PostgreSQL – Improve Query Performance When Joining by jsonb_exists

postgresql

Our application recently changed how some data was modeled, changing the schema to add a jsonb array column, and populating it from data in another column.

We're finding that the new queries joining on this column are unacceptably slow against our full data set, and are in a tight spot as we're forced to still use the old queries until this is resolved.

The main table we're querying on is defined like so:

CREATE TABLE public.visit (
  visit_key CHARACTER VARYING PRIMARY KEY NOT NULL,
  store_key CHARACTER VARYING NOT NULL,
  launch_key CHARACTER VARYING NOT NULL,
  when_ TIMESTAMP WITHOUT TIME ZONE NOT NULL,
  page CHARACTER VARYING NOT NULL,
  ip CHARACTER VARYING NOT NULL,
  useragent CHARACTER VARYING,
  referrer CHARACTER VARYING,
  visitor CHARACTER VARYING NOT NULL,
  country CHARACTER VARYING,
  region CHARACTER VARYING,
  city CHARACTER VARYING,
  vars JSONB,
  stops JSONB,
  FOREIGN KEY (launch_key) REFERENCES public.launch (launch_key)
  MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION,
  FOREIGN KEY (store_key) REFERENCES public.store (store_key)
  MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION
);
CREATE INDEX visit_store_key_idx ON visit USING BTREE (store_key);
CREATE INDEX visit_launch_key_idx ON visit USING BTREE (launch_key);
CREATE INDEX visit_visitor_idx ON visit USING BTREE (visitor);
CREATE INDEX visit_stops_idx ON visit USING GIN (stops);
CREATE INDEX visit_when__idx ON visit USING BTREE (when_);

Our old "fast" query and corresponding plan looks like this:

EXPLAIN (ANALYZE, VERBOSE)
SELECT
    date_trunc('day', "public"."visit"."when_" :: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'America/Los_Angeles'),
    count("public"."visit"."visitor"),
    count(DISTINCT "public"."visit"."visitor")
FROM "public"."visit" join "public"."launch" ON ("public"."visit"."launch_key" = "public"."launch"."launch_key")
WHERE (
    "public"."visit"."store_key" = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRi53Ku8Cgw':: VARCHAR AND
    "public"."visit"."when_" >= CAST ('2016-11-03' AS TIMESTAMP ):: TIMESTAMP )
GROUP BY date_trunc('day', "public"."visit"."when_":: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'America/Los_Angeles')
ORDER BY date_trunc('day', "public"."visit"."when_":: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'America/Los_Angeles') ASC;

GroupAggregate  (cost=14018.92..14022.71 rows=138 width=45) (actual time=102.605..102.634 rows=6 loops=1)
  Output: (date_trunc('day'::text, timezone('America/Los_Angeles'::text, (visit.when_)::timestamp with time zone))), count(visit.visitor), count(DISTINCT visit.visitor)
  Group Key: (date_trunc('day'::text, timezone('America/Los_Angeles'::text, (visit.when_)::timestamp with time zone)))
  ->  Sort  (cost=14018.92..14019.26 rows=138 width=45) (actual time=102.568..102.580 rows=38 loops=1)
        Output: (date_trunc('day'::text, timezone('America/Los_Angeles'::text, (visit.when_)::timestamp with time zone))), visit.visitor
        Sort Key: (date_trunc('day'::text, timezone('America/Los_Angeles'::text, (visit.when_)::timestamp with time zone)))
        Sort Method: quicksort  Memory: 27kB
        ->  Nested Loop  (cost=12529.29..14014.02 rows=138 width=45) (actual time=101.423..102.538 rows=38 loops=1)
              Output: date_trunc('day'::text, timezone('America/Los_Angeles'::text, (visit.when_)::timestamp with time zone)), visit.visitor
              ->  Bitmap Heap Scan on public.visit  (cost=12528.73..13080.60 rows=138 width=117) (actual time=101.263..101.451 rows=38 loops=1)
                    Output: visit.when_, visit.visitor, visit.launch_key
                    Recheck Cond: (((visit.store_key)::text = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRi53Ku8Cgw'::text) AND (visit.when_ >= '2016-11-03 00:00:00'::timestamp without time zone))
                    Heap Blocks: exact=38
                    ->  BitmapAnd  (cost=12528.73..12528.73 rows=138 width=0) (actual time=101.239..101.239 rows=0 loops=1)
                          ->  Bitmap Index Scan on visit_store_key_idx  (cost=0.00..1762.64 rows=23727 width=0) (actual time=1.034..1.034 rows=4863 loops=1)
                                Index Cond: ((visit.store_key)::text = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRi53Ku8Cgw'::text)
                          ->  Bitmap Index Scan on visit_when__idx  (cost=0.00..10765.77 rows=326561 width=0) (actual time=99.087..99.087 rows=843286 loops=1)
                                Index Cond: (visit.when_ >= '2016-11-03 00:00:00'::timestamp without time zone)
              ->  Index Only Scan using launch_pkey on public.launch  (cost=0.56..6.75 rows=1 width=72) (actual time=0.026..0.026 rows=1 loops=38)
                    Output: launch.launch_key
                    Index Cond: (launch.launch_key = (visit.launch_key)::text)
                    Heap Fetches: 38
Planning time: 1.225 ms
Execution time: 102.714 ms

The new query joins on the new jsonb stops column (populated from the old visit.launch_key column).

Data was migrated such that if visit.launch_key contained 'abc', visit.stops now contains ['abc'].

I'd expect the GIN index we put on stops to enable this query to perform comparably to the old query, but the performance is really bad:

EXPLAIN (ANALYZE, VERBOSE)
SELECT
    date_trunc('day', "public"."visit"."when_" :: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'America/Los_Angeles'),
    count("public"."visit"."visitor"),
    count(DISTINCT "public"."visit"."visitor")
FROM "public"."visit" join "public"."launch" ON ("public"."visit"."stops" ? "public"."launch"."launch_key")
        WHERE (
        "public"."visit"."store_key" = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRi53Ku8Cgw':: VARCHAR AND
        "public"."visit"."when_" >= CAST ('2016-11-03' AS TIMESTAMP ):: TIMESTAMP )
        GROUP BY date_trunc('day', "public"."visit"."when_":: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'America/Los_Angeles')
        ORDER BY date_trunc('day', "public"."visit"."when_":: TIMESTAMP WITH TIME ZONE AT TIME ZONE 'America/Los_Angeles') ASC;

GroupAggregate  (cost=7643302.27..7648253.15 rows=128 width=45) (actual time=54946.583..54946.605 rows=5 loops=1)
  Output: (date_trunc('day'::text, timezone('America/Los_Angeles'::text, (visit.when_)::timestamp with time zone))), count(visit.visitor), count(DISTINCT visit.visitor)
  Group Key: (date_trunc('day'::text, timezone('America/Los_Angeles'::text, (visit.when_)::timestamp with time zone)))
  ->  Sort  (cost=7643302.27..7644539.43 rows=494864 width=45) (actual time=54946.535..54946.539 rows=34 loops=1)
        Output: (date_trunc('day'::text, timezone('America/Los_Angeles'::text, (visit.when_)::timestamp with time zone))), visit.visitor
        Sort Key: (date_trunc('day'::text, timezone('America/Los_Angeles'::text, (visit.when_)::timestamp with time zone)))
        Sort Method: quicksort  Memory: 27kB
        ->  Nested Loop  (cost=11758.37..7596496.37 rows=494864 width=45) (actual time=54648.912..54946.507 rows=34 loops=1)
              Output: date_trunc('day'::text, timezone('America/Los_Angeles'::text, (visit.when_)::timestamp with time zone)), visit.visitor
              Join Filter: (visit.stops ? (launch.launch_key)::text)
              Rows Removed by Join Filter: 134025620
              ->  Seq Scan on public.launch  (cost=0.00..157554.25 rows=3866125 width=72) (actual time=0.015..1381.991 rows=3941931 loops=1)
                    Output: launch.launch_key
              ->  Materialize  (cost=11758.37..12270.96 rows=128 width=126) (actual time=0.000..0.005 rows=34 loops=3941931)
                    Output: visit.when_, visit.visitor, visit.stops
                    ->  Bitmap Heap Scan on public.visit  (cost=11758.37..12270.32 rows=128 width=126) (actual time=135.903..136.041 rows=34 loops=1)
                          Output: visit.when_, visit.visitor, visit.stops
                          Recheck Cond: (((visit.store_key)::text = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRi53Ku8Cgw'::text) AND (visit.when_ >= '2016-11-03 00:00:00'::timestamp without time zone))
                          Heap Blocks: exact=34
                          ->  BitmapAnd  (cost=11758.37..11758.37 rows=128 width=0) (actual time=135.879..135.879 rows=0 loops=1)
                                ->  Bitmap Index Scan on visit_store_key_idx  (cost=0.00..1762.64 rows=23727 width=0) (actual time=64.278..64.278 rows=4859 loops=1)
                                      Index Cond: ((visit.store_key)::text = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRi53Ku8Cgw'::text)
                                ->  Bitmap Index Scan on visit_when__idx  (cost=0.00..9995.41 rows=303313 width=0) (actual time=70.842..70.842 rows=716638 loops=1)
                                      Index Cond: (visit.when_ >= '2016-11-03 00:00:00'::timestamp without time zone)
Planning time: 0.285 ms
Execution time: 54946.700 ms

I don't understand why our old query cuts the sort cost so much, while the new one does not. Is there anything that can be done to make the query that joins on launch using the jsonb stops column perform reasonably?

Postgres version:

$ psql --version
psql (PostgreSQL) 9.5.4

Best Answer

Data was migrated such that if visit.launch_key contained 'abc', visit.stops now contains ['abc'].

This seems like a really bad idea. Why do you want to do that? Is it too late to change your mind?

PostgreSQL can get a pretty good estimate of how selective visit.store_key = parameter is. It has much less accurate information on how selective visit.stops ? parameter.

When the ? is only used as a filter, this probably isn't going to be so bad. But when it is used as a join operator, this leads the planner to make very poor choices.

This is an acknowledged shortcoming and there has been some discussion of trying to improve this situation for a future version of PostgreSQL, but so far I don't think it has moved beyond discussion so I wouldn't be terrible optimistic. But even if it were fixed, I would still expect a scalar which is a scalar is going to work better than a scalar dressed up as an array.

By the way, you asked why it slows down your sort. It doesn't. The time reported for the sort operation includes the time needed to generate the data that is to be sorted. The sort itself is very quick.