Postgresql – Query results not as expected – Jsonb Postgres

btreeindexjsonoptimizationpostgresql

We've created the following table:

CREATE TABLE public.influencers
(
    id integer NOT NULL DEFAULT nextval('influencers_id_seq'::regclass),
    location jsonb,
    gender text COLLATE pg_catalog."default",
    birthdate timestamp without time zone,
    ig jsonb,
    contact_info jsonb,
    created_at timestamp without time zone DEFAULT now(),
    updated_at timestamp without time zone DEFAULT now(),
    categories text[] COLLATE pg_catalog."default",
    search_field text COLLATE pg_catalog."default",
    search_vector tsvector,
    ig_updated_at timestamp without time zone,
    CONSTRAINT influencers_pkey PRIMARY KEY (id),
    CONSTRAINT ig_id_must_exist CHECK (ig ? 'id'::text),
    CONSTRAINT ig_username_must_exist CHECK (ig ? 'username'::text)
)

And created the following index in order to filter a field inside ig that contains integers. This is the index we created:

CREATE INDEX idx_btree_ig_follower_count
  ON public.influencers
  USING BTREE ((ig->>'follower_count'));

The following query should return 427000 results. Instead, it returns a random number that nothing has to do with what it should return. This is the query:

SELECT count(*)
FROM "influencers"
WHERE ((ig ->> 'follower_count') >= '1000')
  AND ((ig ->> 'follower_count') <= '10000')
  AND (ig->>'follower_count') IS NOT NULL

We suspect that it is comparing strings (or text) instead of integers and that's why it's returning incomprehensible results.

However, the query that works (return expected rows) but doesn't use the idx_btree_ig_follower_count index is the following:

SELECT count(*)
FROM "influencers"
WHERE ((ig -> 'follower_count') >= '1000')
  AND ((ig -> 'follower_count') <= '10000')
  AND (ig->'follower_count') IS NOT NULL;

How can we create the query or the index in order to being able to query efficiently?

Best Answer

Try this.. Just out of curiosity..

CREATE INDEX idx_btree_ig_follower_count
  ON public.influencers
  USING BTREE (CAST(ig->>'follower_count' AS int));

Then in your query, cast to int.

SELECT count(*)
FROM "influencers"
WHERE ((ig ->> 'follower_count')::int >= '1000')
  AND ((ig ->> 'follower_count')::int <= '10000');

You don't need to specify IS NOT NULL if you specify >=, and <=

Test case

I created some test data with 10 million rows of jsonb, with btree index over it.

CREATE TABLE influences
AS
  SELECT
    jsonb_build_object('follower_count', trunc(random()*1e7)::int) AS ig 
  FROM generate_series(1,1e7);

CREATE INDEX ON influences( CAST( (ig->>'follower_count') AS int) );

I'm getting the same query plan you provided,. You can see mine here.

EXPLAIN (ANALYZE,BUFFERS)
SELECT count(*) FROM influences
WHERE (ig->>'follower_count')::int BETWEEN 0 AND 854780;
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=130124.36..130124.37 rows=1 width=0) (actual time=482.615..482.615 rows=1 loops=1)
   Buffers: shared hit=85674
   ->  Bitmap Heap Scan on influences  (cost=18483.87..127946.94 rows=870969 width=0) (actual time=206.182..426.880 rows=855570 loops=1)
         Recheck Cond: ((((ig ->> 'follower_count'::text))::integer >= 0) AND (((ig ->> 'follower_count'::text))::integer <= 854780))
         Heap Blocks: exact=83334
         Buffers: shared hit=85674
         ->  Bitmap Index Scan on influences_int4_idx  (cost=0.00..18266.13 rows=870969 width=0) (actual time=182.158..182.158 rows=855570 loops=1)
               Index Cond: ((((ig ->> 'follower_count'::text))::integer >= 0) AND (((ig ->> 'follower_count'::text))::integer <= 854780))
               Buffers: shared hit=2340
 Planning time: 0.341 ms
 Execution time: 482.665 ms
(11 rows)

I'm aggregating far more rows, 855570 in my example, and it's over 10x faster.

I clustered it

CLUSTER influences USING influences_int4_idx ;

And, that halfed the time further to 200ms

EXPLAIN (ANALYZE,BUFFERS) SELECT count(*) FROM influences WHERE (ig->>'follower_count')::int BETWEEN 0 AND 854780;
                                                                    QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=130124.66..130124.67 rows=1 width=0) (actual time=211.880..211.881 rows=1 loops=1)
   Buffers: shared hit=4 read=9466
   ->  Bitmap Heap Scan on influences  (cost=18483.94..127947.22 rows=870976 width=0) (actual time=69.503..155.670 rows=855570 loops=1)
         Recheck Cond: ((((ig ->> 'follower_count'::text))::integer >= 0) AND (((ig ->> 'follower_count'::text))::integer <= 854780))
         Heap Blocks: exact=7130
         Buffers: shared hit=4 read=9466
         ->  Bitmap Index Scan on influences_int4_idx  (cost=0.00..18266.20 rows=870976 width=0) (actual time=68.150..68.150 rows=855570 loops=1)
               Index Cond: ((((ig ->> 'follower_count'::text))::integer >= 0) AND (((ig ->> 'follower_count'::text))::integer <= 854780))
               Buffers: shared hit=3 read=2337
 Planning time: 0.412 ms
 Execution time: 211.929 ms
(11 rows)