Postgresql – Optimizing “WHERE x BETWEEN a AND b GROUP BY y” query

indexoptimizationperformancepostgresqlpostgresql-9.4

CREATE TABLE test_table
(
  id uuid NOT NULL,
  "RefId" uuid NOT NULL,
  "timestampCol" timestamp without time zone NOT NULL,
  "bigint1" bigint NOT NULL,
  "bigint2" bigint NOT NULL,
  "int1" integer NOT NULL,
  "int2" integer NOT NULL,
  "bigint3" bigint NOT NULL,
  "bigint4" bigint NOT NULL,
  "bigint5" bigint NOT NULL,
  "hugeText" text NOT NULL,
  "bigint6" bigint NOT NULL,
  "bigint7" bigint NOT NULL,
  "bigint8" bigint NOT NULL,
  "denormalizedData" jsonb NOT NULL,
  "textCol" text NOT NULL,
  "smallText" text NOT NULL,
  "createdAt" timestamp with time zone NOT NULL,
  "updatedAt" timestamp with time zone NOT NULL,
  CONSTRAINT test_pkey PRIMARY KEY (id)
);

SELECT "textCol", SUM("bigint1"), SUM("bigint2") -- etc, almost every single column gets aggregated
FROM "test_table"
WHERE "timestampCol" BETWEEN '2016-06-12' AND '2016-06-17'
GROUP BY "textCol"
ORDER BY SUM("bingint2"), SUM("bigint3") DESC -- the ORDER BY columns are dynamic, but there's only 4 possible combination of columns.
LIMIT 50;

Please correct me where my understanding is incorrect. In Postgres, I can either leverage an index on timestampCol or on textCol, but never both at the same time? The query plans I've pasted are meant to show the algorithms being picked by Postgres only. The real tables have a few million rows, not only ~66,000.

  1. CREATE INDEX timestamp_col_index on test_table using btree ("timestampCol")
    

    An index (btree) on "timestampCol" means that the query planner will slice the whole dataset to only keep the rows in between '2016-06-12' and '2016-06-17' before using a Hash Join or a Sort + GroupAggregate to group the rows by textCol.

    GroupAggregate  (cost=3925.50..4483.19 rows=22259 width=41) (actual time=80.764..125.342 rows=22663 loops=1)
      Group Key: "textCol"
      ->  Sort  (cost=3925.50..3981.45 rows=22380 width=41) (actual time=80.742..84.915 rows=22669 loops=1)
            Sort Key: "textCol"
            Sort Method: quicksort  Memory: 2540kB
            ->  Index Scan using timestamp_col_index on test_table  (cost=0.29..2308.56 rows=22380 width=41) (actual time=0.053..13.939 rows=22669 loops=1)
                  Index Cond: (("timestampCol" >= '2016-06-12 00:00:00'::timestamp without time zone) AND ("timestampCol" <= '2016-06-17 00:00:00'::timestamp without time zone))
    
  2. CREATE INDEX text_col_index on test_table using btree ("textCol")
    

    An index (btree) on textCol means the query planner already has the rows "pre-grouped" but it has to traverse every single row in the index to filter out those that don't match timestampCol BETWEEN timestamp1 AND timestamp2.

    GroupAggregate  (cost=0.42..16753.91 rows=22259 width=41) (actual time=0.281..127.047 rows=22663 loops=1)
      Group Key: "textCol"
      ->  Index Scan using text_col_index on test_table  (cost=0.42..16252.18 rows=22380 width=41) (actual time=0.235..76.182 rows=22669 loops=1)
            Filter: (("timestampCol" >= '2016-06-12 00:00:00'::timestamp without time zone) AND ("timestampCol" <= '2016-06-17 00:00:00'::timestamp without time zone))
            Rows Removed by Filter: 43719
    
  3. Creating both indexes means Postgres will run a cost analysis to decide which of 1. and 2. it thinks will be the fastest. But it'll never leverage both indexes at the same time.

  4. Creating a multicolumn index will not help in any way. From my testing Postgres will not change its query plan at all whether it's ("textCol", "timestampCol") or ("timestampCol", "textCol").

  5. I've tried the btree_gin and btree_gist extensions, but I was never able to get the query planner to keep the rows "pre-grouped" or to leverage decent speed gains at a the ~4,000,000 rows scale compared to 1. and 2. Maybe I wasn't using them correctly? How would I structure those indexes and adapt my query to it?

Please let me know what I might be misunderstanding. How can I optimize such a query for a table containing a few million rows?

Important information about the structure of the data:

  • The timestamps used in BETWEEN are 99% of the time similar to "last 2 weeks" or "last month". In some cases, the BETWEEN will end up selecting up to 99% of the rows, but very rarely will it select 100% of them.

  • The textCol column can be incredibly varied or incredibly regular. In some cases, out of let's say 3 million rows there'll be 2.9 million unique textCol values. In other cases for the same number of rows there'll be only 30,000-100,000 unique textCol values.

I'm using Postgres 9.4, but upgrading to 9.5 is feasible as long as the performance gains can justify it.

Best Answer

Idea 1

Judging by their names, the columns "denormalizedData" and "hugeText" seem to be comparatively big, probably many times as big as the columns involved in your query. Size matters for big queries like this. Very big values (> 2kb) for text or jsonb get "toasted", which can avert the worst. But even the remainder or smaller values stored inline can be several times as big as the columns relevant to your query, which span around 100 bytes.

Related:

Splitting columns relevant to the query into a separate 1:1 table might go a long way. (Depends on the complete situation. You add some storage overhead for another row header and another PK and writing to the tables gets a bit more complicated and expensive.)

Idea 2

Also (like you confirmed) only 4 columns are relevant to determine the top 50.

You might have an angle there for a much smaller materialized view (MV) containing just those columns plus "timestampCol" and "textCol" and only the "last 2 weeks" or "last month" of data. Run a fast query on the MV to identify the top 50 "textCol" and only retrieve those rows from the big table. Or, to be precise, just the additional columns not contained in your MV - you get sums for those in the first step.

You only need an index on ("textCol") for the big table. And another one on ("timestampCol") for the MV - which would only be used for instances of your query with a selective WHERE clause. Else, it will be cheaper to sequentially scan the whole MV.

If many of your queries cover the same period of time, you might go one step further: only save one row per "textCol" in the MV with pre-aggregated sums (maybe two or more MV for a couple of frequently used time periods). You get the idea. That should be much faster, yet.

You might even create the MVs with the whole result set and refresh before the first new query for the day.

Depending on exact numbers, you might combine both ideas.