Postgresql – query an indexed jsonb column

full-text-searchindexjsonpostgresql

I want to query a jsonb column with the @> operator.

This query gives results:

select * from json_table1
where p_attribute @> '{"age":"37"}'

This query doesn't fail but also doesn't give results:

select * from json_table1
where p_attribute -> 'age' @> '37'::jsonb

The table used in the example:

create table json_table1 (
    p_id int primary key,
    first_name varchar(20),
    last_name varchar(20),
    p_attribute jsonb,
    quote_content text
)

The index on p_attribute:

create index gin_idx on json_table1 using gin(p_attribute jsonb_path_ops)

Best Answer

Since "37" is a string, not a number, your second query would have to use a jsonb string:

... WHERE p_attribute -> 'age' @> '"37"'::jsonb

Note the double quotes around 37.

Also note that this query wouldn't be able to use the index, while the first query can.

Related Solutions

PostgreSQL – Inconsistent Statistics on JSONB Column with Btree Index

Currently (version 9.6), Postgres does not have any statistics about the internals of document types like json, jsonb, xml or hstore. (There has been discussion whether and how to change that.) Instead, the Postgres query planner uses constant default frequency estimates (like you observed).

However, there are separate statistics for functional indexes like your idx_test_btree. The manual has this tip for you:

Tip: Although per-column tweaking of ANALYZE frequency might not be very productive, you might find it worthwhile to do per-column adjustment of the level of detail of the statistics collected by ANALYZE. Columns that are heavily used in WHERE clauses and have highly irregular data distributions might require a finer-grain data histogram than other columns. See ALTER TABLE SET STATISTICS, or change the database-wide default using the default_statistics_target configuration parameter.

Also, by default there is limited information available about the selectivity of functions. However, if you create an expression index that uses a function call, useful statistics will be gathered about the function, which can greatly improve query plans that use the expression index.

The volume of statistics gathered depends on general setting of default_statistics_target, which can be overruled with a per-column setting. The setting for the column automatically covers depending indexes.

The default setting of 100 is conservative. For your test with 1M rows, if data distribution is uneven, it may help to increase it substantially. Checking on this once more I found you can actually tweak the statistics target per index column with ALTER INDEX, which is currently not documented. See related discussion on pgsql-docs.

ALTER TABLE idx_test_btree ALTER int4 SET STATISTICS 2000;  -- max 10000, default 100

Default names for index columns are not exactly intuitive, but you can look it up with:

SELECT attname FROM pg_attribute WHERE attrelid = 'idx_test_btree'::regclass

Should result in the type name int4 as index column name for your case.

The best setting for STATISTICS depends on several factors: data distribution, data type, update frequency, characteristics of typical queries, ...

Internally, this sets the value of pg_attribute.attstattarget, and the exact meaning of this is (per documentation):

For scalar data types, attstattarget is both the target number of "most common values" to collect, and the target number of histogram bins to create.

Then run ANALYZE if you don't want to wait for autovacuum to kick in:

ANALYZE test_data;

You must ANALYZE the table, since you cannot ANALYZE indexes directly. Check with (before and after if you want to verify the effect):

SELECT * FROM pg_statistic WHERE starelid = 'idx_test_btree'::regclass;

Try your query again ...

How to Implement Full-Text Search on JSONB Column in PostgreSQL

The way I read this question, you only care about message. The difficulty here is that you need to,

map over a json array returning the message element
reduce/fold the array of message element strings to an aggregate string.

This is easy in functional programming. It's not as easy with the stock functions in PostgreSQL, and it'd be difficult to make it work with a declarative language. Maybe one day you'll have a jsonb_array_elements(jsonb [,path]) which will get you by but until then we can create a function in our database.

Creating a function with plpgsql

Note this probably isn't as a fast nor as clean as a plv8 function, but in the next revision we'll return a tsvector.

Here we use jsonb_array_elements to expand the json, and then aggregate back the 'message' elements into a string.

CREATE OR REPLACE FUNCTION jsonb_message_to_string( jsondata jsonb, out string text )
AS $func$
  BEGIN
    SELECT INTO string
      string_agg(d->>'message', ' ')
    FROM jsonb_array_elements(jsondata) AS d;
    RETURN;
  END;
$func$ LANGUAGE plpgsql
IMMUTABLE;

Creating `tsvector_agg` and improving our function.

This function is not yet optimal though because it's returning a string. However, there is a second difficulty in that as of 9.6 PostgreSQL does not yet ship with a tsvector_agg; but, it's PostgreSQL so we can make one..

CREATE AGGREGATE tsvector_agg (tsvector) (
  SFUNC = tsvector_concat,
  STYPE = tsvector
);

This permits us to now return an aggregate tsvector which is faster and retains positional information. Now we can improve our function. Here we create a new jsonb_message_to_tsvector.

CREATE OR REPLACE FUNCTION jsonb_message_to_tsvector( jsondata jsonb, out tsv tsvector )
AS $func$
  BEGIN
    SELECT INTO tsv
      tsvector_agg(to_tsvector(d->>'message'))
    FROM jsonb_array_elements(jsondata) AS d;
    RETURN;
  END;
$func$ LANGUAGE plpgsql
IMMUTABLE;

Now we can create our index..

CREATE INDEX ON FOO
  USING gin (jsonb_message_to_tsvector(jsondata));

And we would query it like so..

SELECT jsonb_message_to_tsvector(jsondata) @@ 'first'
FROM foo;