Indexes for SQL Query with WHERE Condition and GROUP BY

indexoptimizationperformancepostgresqlpostgresql-9.3query-performance

I am trying to determine which indexes to use for an SQL query with a WHERE condition and a GROUP BY which is currently running very slow.

My query:

SELECT group_id
FROM counter
WHERE ts between timestamp '2014-03-02 00:00:00.0' and timestamp '2014-03-05 12:00:00.0'
GROUP BY group_id

The table currently has 32.000.000 rows. The execution time of the query increases a lot when I increase the time-frame.

The table in question looks like this:

CREATE TABLE counter (
    id bigserial PRIMARY KEY
  , ts timestamp NOT NULL
  , group_id bigint NOT NULL
);

I currently have the following indexes, but the performance is still slow:

CREATE INDEX ts_index
  ON counter
  USING btree
  (ts);

CREATE INDEX group_id_index
  ON counter
  USING btree
  (group_id);

CREATE INDEX comp_1_index
  ON counter
  USING btree
  (ts, group_id);

CREATE INDEX comp_2_index
  ON counter
  USING btree
  (group_id, ts);

Running EXPLAIN on the query gives the following result:

"QUERY PLAN"
"HashAggregate  (cost=467958.16..467958.17 rows=1 width=4)"
"  ->  Index Scan using ts_index on counter  (cost=0.56..467470.93 rows=194892 width=4)"
"        Index Cond: ((ts >= '2014-02-26 00:00:00'::timestamp without time zone) AND (ts <= '2014-02-27 23:59:00'::timestamp without time zone))"

SQL Fiddle with example data: http://sqlfiddle.com/#!15/7492b/1

The Question

Can the performance of this query be improved by adding better indexes, or must I increase the processing power?

Edit 1

PostgreSQL version 9.3.2 is used.

Edit 2

I tried @Erwin 's proposal with EXISTS:

SELECT group_id
FROM   groups g
WHERE  EXISTS (
   SELECT 1
   FROM   counter c
   WHERE  c.group_id = g.group_id
   AND    ts BETWEEN timestamp '2014-03-02 00:00:00'
                 AND timestamp '2014-03-05 12:00:00'
   );

But unfortunetly this didn't seem to increase the performance. The Query Plan:

"QUERY PLAN"
"Nested Loop Semi Join  (cost=1607.18..371680.60 rows=113 width=4)"
"  ->  Seq Scan on groups g  (cost=0.00..2.33 rows=133 width=4)"
"  ->  Bitmap Heap Scan on counter c  (cost=1607.18..158895.53 rows=60641 width=4)"
"        Recheck Cond: ((group_id = g.id) AND (ts >= '2014-01-01 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
"        ->  Bitmap Index Scan on comp_2_index  (cost=0.00..1592.02 rows=60641 width=0)"
"              Index Cond: ((group_id = g.id) AND (ts >= '2014-01-01 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"

Edit 3

The query plan for the LATERAL query from ypercube:

"QUERY PLAN"
"Nested Loop  (cost=8.98..1200.42 rows=133 width=20)"
"  ->  Seq Scan on groups g  (cost=0.00..2.33 rows=133 width=4)"
"  ->  Result  (cost=8.98..8.99 rows=1 width=0)"
"        One-Time Filter: ($1 IS NOT NULL)"
"        InitPlan 1 (returns $1)"
"          ->  Limit  (cost=0.56..4.49 rows=1 width=8)"
"                ->  Index Only Scan using comp_2_index on counter c  (cost=0.56..1098691.21 rows=279808 width=8)"
"                      Index Cond: ((group_id = $0) AND (ts IS NOT NULL) AND (ts >= '2010-03-02 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
"        InitPlan 2 (returns $2)"
"          ->  Limit  (cost=0.56..4.49 rows=1 width=8)"
"                ->  Index Only Scan Backward using comp_2_index on counter c_1  (cost=0.56..1098691.21 rows=279808 width=8)"
"                      Index Cond: ((group_id = $0) AND (ts IS NOT NULL) AND (ts >= '2010-03-02 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"

Best Answer

Another idea, that also uses the groups table and a construction called LATERAL join (for SQL-Server fans, this is almost identical to OUTER APPLY). It has the advantage that aggregates can be calculated in the subquery:

SELECT group_id, min_ts, max_ts
FROM   groups g,                    -- notice the comma here, is required
  LATERAL 
       ( SELECT MIN(ts) AS min_ts,
                MAX(ts) AS max_ts
         FROM counter c
         WHERE c.group_id = g.group_id
           AND c.ts BETWEEN timestamp '2011-03-02 00:00:00'
                        AND timestamp '2013-03-05 12:00:00'
       ) x 
WHERE min_ts IS NOT NULL ;

Test at SQL-Fiddle shows that the query does index scans on the (group_id, ts) index.

Similar plans are produced using 2 lateral joins, one for min and one for max and also with 2 inline correlated subqueries. They could also be used if you need to show the whole counter rows besides the min and max dates:

SELECT group_id, 
       min_ts, min_ts_id, 
       max_ts, max_ts_id 
FROM   groups g
  , LATERAL 
       ( SELECT ts AS min_ts, c.id AS min_ts_id
         FROM counter c
         WHERE c.group_id = g.group_id
           AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
                        AND timestamp '2014-03-05 12:00:00'
         ORDER BY ts ASC
         LIMIT 1
       ) xmin
  , LATERAL 
       ( SELECT ts AS max_ts, c.id AS max_ts_id
         FROM counter c
         WHERE c.group_id = g.group_id
           AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
                        AND timestamp '2014-03-05 12:00:00'
         ORDER BY ts DESC 
         LIMIT 1
       ) xmax
WHERE min_ts IS NOT NULL ;

The Question

Edit 1

Edit 2

Edit 3

Best Answer

Related Solutions

PostgreSQL Indexing – How to Index WHERE (start_date >= ‘2013-12-15’)

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Related Question