PostgreSQL – Query Optimization for Finding Suitable Work from List

jsonperformancepostgresqlquery-performancerdbms

I have table with jsonb column as below

CREATE TABLE
    work
    (
        id SERIAL NOT NULL,
        work_data JSONB
    );

sample data is as follows:

100 {"work_id": [7245, 3991, 3358, 1028]}

I created a gin index for for work_id as below:

CREATE INDEX idzworkdata ON work USING gin ((work_data -> 'work_id'));

Postgres documentation says gin index works for @> containment operator.
But I need to find all the work records which has work_id's that user inputs, for which I need to use <@ operator.

Link to postgres documentation:
https://www.postgresql.org/docs/current/datatype-json.html

Section 8.14.4

"The default GIN operator class for jsonb supports queries with the
@>, ?, ?& and ?| operators. (For details of the semantics that these
operators implement, see Table 9-41.) An example of creating an index
with this operator class is"

When I execute the following query:

select *  
from public.work
where  work_json ->'skill'  <@ '[ 3587, 3422,7250, 458 ]'

Execution Plan:

Gather  (cost=1000.00..246319.01 rows=10000 width=114) (actual time=0.568..2647.415 rows=1 loops=1)                          
  Workers Planned: 2                                                                                                         
  Workers Launched: 2                                                                                                        
  ->  Parallel Seq Scan on work  (cost=0.00..244319.01 rows=4167 width=114) (actual time=1746.766..2627.820 rows=0 loops=3)  
        Filter: ((work_json -> 'skill'::text) <@ '[3587, 3422, 7250, 458]'::jsonb)                                           
        Rows Removed by Filter: 3333333                                                                                      
Planning Time: 1.456 ms                                                                                                      
Execution Time: 2647.470 ms

The query does not use the gin index . Is there any workaround I can use to use the gin index for <@ operator?

Update 2:

Approach which is not postgres specific:

The query is taking around 40 to 50 secs which is huge

I have used two tables

CREATE TABLE public.work
(
    id integer NOT NULL DEFAULT nextval('work_id_seq'::regclass),
    work_data_id integer[],
    work_json jsonb
)

CREATE TABLE public.work_data
(
    work_data_id bigint,
    work_id bigint
)

Query:

select work.id 
from work  
   inner join work_data on (work.id=work_data.work_id) 
group by work.id 
having sum(case when work_data.work_data_id in (2269,3805,828,9127) then 0 else 1 end)=0

Finalize GroupAggregate  (cost=3618094.30..6459924.90 rows=50000 width=4) (actual time=41891.301..64750.815 rows=1 loops=1)                                      
  Group Key: work.id                                                                                                                                             
  Filter: (sum(CASE WHEN (work_data.work_data_id = ANY ('{2269,3805,828,9127}'::bigint[])) THEN 0 ELSE 1 END) = 0)                                               
  Rows Removed by Filter: 9999999                                                                                                                                
  ->  Gather Merge  (cost=3618094.30..6234924.88 rows=20000002 width=12) (actual time=41891.217..58887.351 rows=10000581 loops=1)                                
        Workers Planned: 2                                                                                                                                       
        Workers Launched: 2                                                                                                                                      
        ->  Partial GroupAggregate  (cost=3617094.28..3925428.38 rows=10000001 width=12) (actual time=41792.169..53183.859 rows=3333527 loops=3)                 
              Group Key: work.id                                                                                                                                 
              ->  Sort  (cost=3617094.28..3658761.10 rows=16666727 width=12) (actual time=41792.125..45907.253 rows=13333333 loops=3)                            
                    Sort Key: work.id                                                                                                                            
                    Sort Method: external merge  Disk: 339000kB                                                                                                  
                    Worker 0:  Sort Method: external merge  Disk: 338992kB                                                                                       
                    Worker 1:  Sort Method: external merge  Disk: 339784kB                                                                                       
                    ->  Parallel Hash Join  (cost=291846.01..1048214.42 rows=16666727 width=12) (actual time=13844.982..23748.244 rows=13333333 loops=3)         
                          Hash Cond: (work_data.work_id = work.id)                                                                                               
                          ->  Parallel Seq Scan on work_data  (cost=0.00..382884.27 rows=16666727 width=16) (actual time=0.020..4094.341 rows=13333333 loops=3)  
                          ->  Parallel Hash  (cost=223485.67..223485.67 rows=4166667 width=4) (actual time=3345.351..3345.351 rows=3333334 loops=3)              
                                Buckets: 131072  Batches: 256  Memory Usage: 2592kB                                                                              
                                ->  Parallel Seq Scan on work  (cost=0.00..223485.67 rows=4166667 width=4) (actual time=0.182..1603.437 rows=3333334 loops=3)    
Planning Time: 1.544 ms                                                                                                                                          
Execution Time: 65503.341 ms

NOTE: Little background: work table has details of work and respective work id's that are needed to perform the work. Each user can perform certain work Id's which are super set than any work's work id.
So User always has more work Id's. I tried normal Join queries with work table and work id list table as separate tables but the query is doing a table scan and it takes around 40 secs which is huge.

Best Answer

You could use a helper function that converts a jsonb array to an integer array:

CREATE FUNCTION jsonarr2intarr(text) RETURNS int[]
   LANGUAGE sql IMMUTABLE AS
$$SELECT translate($1, '[]', '{}')::int[]$$;

This can be used with an index:

CREATE INDEX ON work USING gin (jsonarr2intarr(work_data ->> 'work_id'));

A modified query can make use of that index:

EXPLAIN (COSTS OFF)
SELECT * FROM work
WHERE jsonarr2intarr(work_data ->> 'work_id')
      <@ ARRAY[1,2,3,5,6,11,7245,3991,3358,1028];

                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on work
   Recheck Cond: (jsonarr2intarr((work_data ->> 'work_id'::text)) <@ '{1,2,3,5,6,11,7245,3991,3358,1028}'::integer[])
   ->  Bitmap Index Scan on work_jsonarr2intarr_idx
         Index Cond: (jsonarr2intarr((work_data ->> 'work_id'::text)) <@ '{1,2,3,5,6,11,7245,3991,3358,1028}'::integer[])
(4 rows)

Best Answer

Related Solutions

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

PostgreSQL Optimization – Explain Plan with Giant Gaps Between Operations

Related Question