Currently (version 9.6), Postgres does not have any statistics about the internals of document types like json
, jsonb
, xml
or hstore
. (There has been discussion whether and how to change that.) Instead, the Postgres query planner uses constant default frequency estimates (like you observed).
However, there are separate statistics for functional indexes like your idx_test_btree
. The manual has this tip for you:
Tip: Although per-column tweaking of ANALYZE
frequency might not be
very productive, you might find it worthwhile to do per-column
adjustment of the level of detail of the statistics collected by
ANALYZE
. Columns that are heavily used in WHERE
clauses and have
highly irregular data distributions might require a finer-grain data
histogram than other columns. See ALTER TABLE SET STATISTICS
, or
change the database-wide default using the default_statistics_target
configuration parameter.
Also, by default there is limited information available about the
selectivity of functions. However, if you create an expression index
that uses a function call, useful statistics will be gathered about
the function, which can greatly improve query plans that use the
expression index.
The volume of statistics gathered depends on general setting of default_statistics_target
, which can be overruled with a per-column setting. The setting for the column automatically covers depending indexes.
The default setting of 100
is conservative. For your test with 1M rows, if data distribution is uneven, it may help to increase it substantially. Checking on this once more I found you can actually tweak the statistics target per index column with ALTER INDEX
, which is currently not documented. See related discussion on pgsql-docs.
ALTER TABLE idx_test_btree ALTER int4 SET STATISTICS 2000; -- max 10000, default 100
Default names for index columns are not exactly intuitive, but you can look it up with:
SELECT attname FROM pg_attribute WHERE attrelid = 'idx_test_btree'::regclass
Should result in the type name int4
as index column name for your case.
The best setting for STATISTICS
depends on several factors: data distribution, data type, update frequency, characteristics of typical queries, ...
Internally, this sets the value of pg_attribute.attstattarget
, and the exact meaning of this is (per documentation):
For scalar data types, attstattarget
is both the target number of
"most common values" to collect, and the target number of histogram
bins to create.
Then run ANALYZE
if you don't want to wait for autovacuum to kick in:
ANALYZE test_data;
You must ANALYZE
the table, since you cannot ANALYZE
indexes directly. Check with (before and after if you want to verify the effect):
SELECT * FROM pg_statistic WHERE starelid = 'idx_test_btree'::regclass;
Try your query again ...
Related:
The way I read this question, you only care about message
. The difficulty here is that you need to,
- map over a json array returning the message element
- reduce/fold the array of message element strings to an aggregate string.
This is easy in functional programming. It's not as easy with the stock functions in PostgreSQL, and it'd be difficult to make it work with a declarative language. Maybe one day you'll have a jsonb_array_elements(jsonb [,path])
which will get you by but until then we can create a function in our database.
Creating a function with plpgsql
Note this probably isn't as a fast nor as clean as a plv8 function, but in the next revision we'll return a tsvector
.
Here we use jsonb_array_elements
to expand the json, and then aggregate back the 'message'
elements into a string.
CREATE OR REPLACE FUNCTION jsonb_message_to_string( jsondata jsonb, out string text )
AS $func$
BEGIN
SELECT INTO string
string_agg(d->>'message', ' ')
FROM jsonb_array_elements(jsondata) AS d;
RETURN;
END;
$func$ LANGUAGE plpgsql
IMMUTABLE;
Creating tsvector_agg
and improving our function.
This function is not yet optimal though because it's returning a string. However, there is a second difficulty in that as of 9.6 PostgreSQL does not yet ship with a tsvector_agg
; but, it's PostgreSQL so we can make one..
CREATE AGGREGATE tsvector_agg (tsvector) (
SFUNC = tsvector_concat,
STYPE = tsvector
);
This permits us to now return an aggregate tsvector which is faster and retains positional information. Now we can improve our function. Here we create a new jsonb_message_to_tsvector
.
CREATE OR REPLACE FUNCTION jsonb_message_to_tsvector( jsondata jsonb, out tsv tsvector )
AS $func$
BEGIN
SELECT INTO tsv
tsvector_agg(to_tsvector(d->>'message'))
FROM jsonb_array_elements(jsondata) AS d;
RETURN;
END;
$func$ LANGUAGE plpgsql
IMMUTABLE;
Now we can create our index..
CREATE INDEX ON FOO
USING gin (jsonb_message_to_tsvector(jsondata));
And we would query it like so..
SELECT jsonb_message_to_tsvector(jsondata) @@ 'first'
FROM foo;
Best Answer
Since
"37"
is a string, not a number, your second query would have to use ajsonb
string:Note the double quotes around
37
.Also note that this query wouldn't be able to use the index, while the first query can.