Postgresql – PosgtreSQL GIN + BTree ordering

execution-planindexperformancepostgresqlquery-performance

I have a table with simple structure (id, metadata_json, stamp), stamp is a timestamp and has Btree index on it. MetadataJson is a jsonb with GIN index.

The table has 25M rows. I am using PostgreSQL 10.

                 Table "public.metadata"
    Column     |            Type             | Modifiers
---------------+-----------------------------+-----------
 id            | uuid                        | not null
 metadata_json | jsonb                       |
 stamp         | timestamp without time zone |
Indexes:
    "metadata_pkey" PRIMARY KEY, btree (id)
    "metadata_idx" gin (metadata_json)
    "stamp_idx" btree (stamp)

The query I am executing is quite simple:

select * from metadata where metadata_json @> '{"someBool": true}'
         AND stamp >= '01-01-2016' ORDER BY stamp DESC LIMIT 100;

My idea how it should work: I have a btree on stamp, hence it should go by the index in reverse order, then it should test rows on the json restriction (it has selectivity 40%). I would expect it to return in few milliseconds.

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=80598.46..80598.71 rows=100 width=381) (actual time=445064.728..445064.791 rows=100 loops=1)
   ->  Sort  (cost=80598.46..80607.46 rows=3600 width=381) (actual time=445064.724..445064.754 rows=100 loops=1)
         Sort Key: stamp DESC
         Sort Method: top-N heapsort  Memory: 109kB
         ->  Bitmap Heap Scan on metadata  (cost=66591.00..80460.87 rows=3600 width=381) (actual time=2881.164..444283.520 rows=1437024 loops=1)
               Recheck Cond: ((metadata_json @> '{"someBool": true}'::jsonb) AND (stamp >= '2016-01-01 00:00:00'::timestamp without time zone))
               Heap Blocks: exact=882439
               ->  BitmapAnd  (cost=66591.00..66591.00 rows=3600 width=0) (actual time=2599.415..2599.415 rows=0 loops=1)
                     ->  Bitmap Index Scan on metadata_idx  (cost=0.00..260.25 rows=25100 width=0) (actual time=1762.166..1762.166 rows=10041746 loops=1)
                           Index Cond: (metadata_json @> '{"someBool": true}'::jsonb)
                     ->  Bitmap Index Scan on stamp_idx  (cost=0.00..66328.69 rows=3600034 width=0) (actual time=760.136..760.136 rows=3591329 loops=1)
                           Index Cond: (stamp >= '2016-01-01 00:00:00'::timestamp without time zone)
 Planning time: 5.008 ms
 Execution time: 445072.043 ms
(14 rows)

From the plan it seems that the planner has statistics really off, but the table is analyzed, sampling was set to 1000.


Edit: after some searching I found out that postgres does not have statistics for jsonb data type… Can you please hint me how this type of query can be optimized?

Edit 2: if I disable bitmap scan, the query is extremely fast (2 millis). But I do not think its a good solution…

Edit 3: fencing (WITH CTE Statement)

Limit  (cost=1489968.48..1489968.73 rows=100 width=56) (actual time=447543.199..447543.262 rows=100 loops=1)
  CTE t
    ->  Bitmap Heap Scan on metadata  (cost=67228.70..1408830.13 rows=3600034 width=381) (actual time=1045.566..441897.315 rows=3591329 loops=1)
          Recheck Cond: (stamp >= '2016-01-01 00:00:00'::timestamp without time zone)
          Heap Blocks: exact=1229457
          ->  Bitmap Index Scan on stamp_idx  (cost=0.00..66328.69 rows=3600034 width=0) (actual time=663.960..663.960 rows=3591329 loops=1)
                Index Cond: (stamp >= '2016-01-01 00:00:00'::timestamp without time zone)
  ->  Sort  (cost=81138.35..81147.35 rows=3600 width=56) (actual time=447543.197..447543.227 rows=100 loops=1)
        Sort Key: t.stamp DESC
        Sort Method: top-N heapsort  Memory: 109kB
        ->  CTE Scan on t  (cost=0.00..81000.76 rows=3600 width=56) (actual time=1045.577..446935.261 rows=1437024 loops=1)
              Filter: (metadata_json @> '{"someBool": true}'::jsonb)
              Rows Removed by Filter: 2154305
Planning time: 0.169 ms
Execution time: 447692.843 ms

Edit 4: Fencing (Subselect in FROM)

Limit  (cost=1798933.42..1811851.02 rows=100 width=381) (actual time=198282.400..198282.400 rows=0 loops=1)
  ->  Subquery Scan on foo  (cost=1798933.42..2186461.48 rows=3000 width=381) (actual time=198282.397..198282.397 rows=0 loops=1)
        Filter: (foo.metadata_json @> '{"someBool": true}'::jsonb)
        Rows Removed by Filter: 3591329
        ->  Gather Merge  (cost=1798933.42..2148961.13 rows=3000028 width=381) (actual time=184803.964..195869.763 rows=3591329 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Sort  (cost=1797933.40..1801683.43 rows=1500014 width=381) (actual time=184599.426..188532.836 rows=1197110 loops=3)
                    Sort Key: metadata.stamp DESC
                    Sort Method: external merge  Disk: 461368kB
                    ->  Parallel Bitmap Heap Scan on metadata  (cost=67228.70..1382579.88 rows=1500014 width=381) (actual time=1171.006..178501.269 rows=1197110 loops=3)
                          Recheck Cond: (stamp >= '2016-01-01 00:00:00'::timestamp without time zone)
                          Heap Blocks: exact=408005
                          ->  Bitmap Index Scan on stamp_idx  (cost=0.00..66328.69 rows=3600034 width=0) (actual time=728.401..728.401 rows=3591329 loops=1)
                                Index Cond: (stamp >= '2016-01-01 00:00:00'::timestamp without time zone)
Planning time: 6.704 ms
Execution time: 198509.456 ms

Forcibly disabled bitmap scan

set enable_bitmapscan = off;
explain analyze select * from metadata where metadata_json @> '{"someBool": true}' AND stamp >= '01-01-2015' ORDER BY stamp DESC LIMIT 100;

Limit  (cost=0.44..256064.27 rows=100 width=381) (actual time=0.065..1.814 rows=100 loops=1)
  ->  Index Scan Backward using stamp_idx on metadata  (cost=0.44..18423793.42 rows=7195 width=381) (actual time=0.064..1.777 rows=100 loops=1)
        Index Cond: (stamp >= '2015-01-01 00:00:00'::timestamp without time zone)
        Filter: (metadata_json @> '{"someBool": true}'::jsonb)
        Rows Removed by Filter: 126
Planning time: 0.180 ms
Execution time: 1.856 ms

Edit 5
Only compound gin(stamp, metadata_json) index present:

explain analyze select * from metadata where metadata_json @> '{"someBool": true}
         AND stamp >= '01-01-2016' ORDER BY stamp DESC LIMIT 100;
Limit  (cost=14132.36..14132.61 rows=100 width=381) (actual time=308836.991..308837.052 rows=100 loops=1)
  ->  Sort  (cost=14132.36..14141.36 rows=3600 width=381) (actual time=308836.988..308837.018 rows=100 loops=1)
        Sort Key: stamp DESC
        Sort Method: top-N heapsort  Memory: 109kB
        ->  Bitmap Heap Scan on metadata  (cost=124.90..13994.77 rows=3600 width=381) (actual time=3160.418..308183.328 rows=1437024 loops=1)
              Recheck Cond: ((stamp >= '2016-01-01 00:00:00'::timestamp without time zone) AND (metadata_json @> '{"someBool": true}'::jsonb))
              Heap Blocks: exact=882439
              ->  Bitmap Index Scan on metadata_stamp_metadata_json_idx  (cost=0.00..124.00 rows=3600 width=0) (actual time=2883.484..2883.484 rows=1437024 loops=1)
                    Index Cond: ((stamp >= '2016-01-01 00:00:00'::timestamp without time zone) AND (metadata_json @> '{"someBool": true}'::jsonb))
Planning time: 0.233 ms
Execution time: 308857.051 ms

Final solution:

I have decomposed the json to key-value and stored it as table "recordId, key, value, stamp". And I created a btree index on these – and the result is returned universally in few millis. I do not think there is any good universal solution for json without statistics.

Correct answer goes to Evan, as this is probably the best what can be done in jsonb structure.

Best Answer

The real problem that you're suffering from here is that the stats on jsonb suck. That's a known issue. It won't ever get fixed either.

->  Bitmap Index Scan on metadata_idx  (cost=0.00..260.25 rows=25100 width=0) (actual time=1762.166..1762.166 rows=10041746 loops=1)
    Index Cond: (metadata_json @> '{"someBool": true}'::jsonb)

Here PostgreSQL expects 25100 but the low selectivity is giving back 10041746. The timestamp estimates are pretty accurate returning 3.6 M. PostgreSQL expects to have to dig through that 3.6 M To find a meager 25,100. That's a lot of digging. So instead it does an index scan on the jsonb.

You have a few options.

  • Create compound GIN index as ypercube suggested.

    CREATE EXTENSION btree_gin;
    CREATE INDEX ON metadata USING gin(stamp, metadata_json);
    
  • Add an index on metadata_json->someBool

  • Use an optimization fence.

    SELECT *
    FROM (
      SELECT *
      FROM metadata
      WHERE stamp >= '01-01-2016'
      ORDER BY stamp DESC
      OFFSET 0
    )
    WHERE metadata_json @> '{"someBool": true}'
    ORDER BY stamp DESC
    LIMIT 100;
    

You may also want to look into jsonb_path_ops. It can also be used to create a compound GIN index,

CREATE INDEX ON metadata USING gin(stamp, metadata_json jsonb_path_ops);