I noticed that the performance of a query involving a jsonb column varied significantly between VACUUM ANALYZE runs while testing it. I get completely different execution plans seemingly randomly after analyzing the table.
I'm using Postgres 9.6 here. The setup for my tests is the following, I'm inserting a single key "x" into the jsonb column "params" with values between 1 and 6, with 1 being the rarest value and 6 the most common. I also have a regular int column "single_param" that contains the same distribution of values for comparison.:
CREATE TABLE test_data (
id serial,
single_param int,
params jsonb
);
INSERT INTO test_data
SELECT
generate_series(1, 1000000) AS id,
floor(log(random() * 9999999 + 1)) AS single_param,
json_build_object(
'x', floor(log(random() * 9999999 + 1))
) AS params;
CREATE INDEX idx_test_btree ON test_data (cast(test_data.params->>'x' AS int));
CREATE INDEX idx_test_gin ON test_data USING GIN (params);
CREATE INDEX ON test_data(id)
CREATE INDEX ON test_data(single_param)
The query I'm testing is a typical query for paginating results, I'm sorting by id and limiting the output to the first 50 rows.
SELECT * FROM test_data where (params->>'x')::int = 1 ORDER BY id DESC LIMIT 50;
I get one of the two explain analyze outputs randomnly after running VACUUM ANALYZE
:
Limit (cost=0.42..836.59 rows=50 width=33) (actual time=39.679..410.292 rows=10 loops=1)
-> Index Scan Backward using test_data_id_idx on test_data (cost=0.42..44317.43 rows=2650 width=33) (actual time=39.678..410.283 rows=10 loops=1)
Filter: (((params ->> 'x'::text))::integer = 1)
Rows Removed by Filter: 999990"
Planning time: 0.106 ms
Execution time: 410.314 ms
or
Limit (cost=8.45..8.46 rows=1 width=33) (actual time=0.032..0.034 rows=10 loops=1)
-> Sort (cost=8.45..8.46 rows=1 width=33) (actual time=0.032..0.032 rows=10 loops=1)
Sort Key: id DESC
Sort Method: quicksort Memory: 25kB
-> Index Scan using idx_test_btree on test_data (cost=0.42..8.44 rows=1 width=33) (actual time=0.007..0.016 rows=10 loops=1)
Index Cond: (((params ->> 'x'::text))::integer = 1)
Planning time: 0.320 ms
Execution time: 0.052 ms
The difference is that the estimate for the number of columns matching the where clause is different between the two plans. In the first one the estimate is 2650 rows, in the second one 1 row while the real number is 10 rows.
The following version of the query that can potentially use the GIN index does seem to use a default estimate for the json column of 1%, which also results in the bad query plan like above:
SELECT * FROM test_data where params @> '{"x": 1}' ORDER BY id DESC LIMIT 50;
My original assumption was that Postgres would not have any statistics on the jsonb column and always use an estimate like it does for the query using the @>
operator. But for the query that is written to be able to use the btree index I created, it uses different estimates. Sometimes those are good enough, and sometimes they're bad.
Where do those estimates come from? I guess they're some kind of statistic Postgres creates with the index. For column statistics there is the option to collect more accurate statistics, is there anything like that for these statistics here? Or any other way to get Postgres to pick the better plan in my case?
Best Answer
Currently (version 9.6), Postgres does not have any statistics about the internals of document types like
json
,jsonb
,xml
orhstore
. (There has been discussion whether and how to change that.) Instead, the Postgres query planner uses constant default frequency estimates (like you observed).However, there are separate statistics for functional indexes like your
idx_test_btree
. The manual has this tip for you:The volume of statistics gathered depends on general setting of
default_statistics_target
, which can be overruled with a per-column setting. The setting for the column automatically covers depending indexes.The default setting of
100
is conservative. For your test with 1M rows, if data distribution is uneven, it may help to increase it substantially. Checking on this once more I found you can actually tweak the statistics target per index column withALTER INDEX
, which is currently not documented. See related discussion on pgsql-docs.Default names for index columns are not exactly intuitive, but you can look it up with:
Should result in the type name
int4
as index column name for your case.The best setting for
STATISTICS
depends on several factors: data distribution, data type, update frequency, characteristics of typical queries, ...Internally, this sets the value of
pg_attribute.attstattarget
, and the exact meaning of this is (per documentation):Then run
ANALYZE
if you don't want to wait for autovacuum to kick in:You must
ANALYZE
the table, since you cannotANALYZE
indexes directly. Check with (before and after if you want to verify the effect):Try your query again ...
Related: