PostgreSQL – Efficient Group By to Count Items in Group

countgroup byperformancepostgresqlquery-performance

My query is:

SELECT COUNT("EventType"."id") AS "eventCount", "EventType"."id" AS "EventType.id"
  FROM "events" AS "Event" 
  INNER JOIN "event_types" AS "EventType" ON "Event"."eventTypeId" = "EventType"."id"
  INNER JOIN "projects" AS "EventType->Project" ON "EventType"."projectId" = "EventType->Project"."id"
  WHERE "EventType->Project"."id" = 142
  GROUP BY "EventType"."id";

Basically I want to know, for a given project, how many of each type of event have occurred.

The relevant schema is:

                                                           Table "public.projects"
      Column       |           Type           |                             Modifiers
-------------------+--------------------------+---------------------------------------------------------
 id                | integer                  | not null default nextval('projects_id_seq'::regclass) 
Indexes:
    "projects_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "event_types" CONSTRAINT "event_types_projectId_fkey" FOREIGN KEY ("projectId") REFERENCES projects(id) ON UPDATE CASCADE ON DELETE CASCADE

                                                         Table "public.event_types"
    Column     |           Type           |                                 Modifiers
---------------+--------------------------+------------------------------------------------------------
 id            | integer                  | not null default nextval('event_types_id_seq'::regclass)
 projectId     | integer                  | not null
Indexes:
    "event_types_pkey" PRIMARY KEY, btree (id)
    "event_types_project_id" btree ("projectId")
Foreign-key constraints:
    "event_types_projectId_fkey" FOREIGN KEY ("projectId") REFERENCES projects(id) ON UPDATE CASCADE ON DELETE CASCADE
Referenced by:
    TABLE "events" CONSTRAINT "events_eventTypeId_fkey" FOREIGN KEY ("eventTypeId") REFERENCES event_types(id) ON UPDATE CASCADE ON DELETE CASCADE

                                              Table "public.events"
 Column      |  Type   |                               Modifiers
-------------+---------+-------------------------------------------------------
 id          | integer | not null default nextval('events_id_seq'::regclass)
 eventTypeId | integer | not null
Indexes:
    "events_pkey" PRIMARY KEY, btree (id)
    "events_event_type_id" btree ("eventTypeId")
Foreign-key constraints:
    "events_eventTypeId_fkey" FOREIGN KEY ("eventTypeId") REFERENCES event_types(id) ON UPDATE CASCADE ON DELETE CASCADE

When I run it through EXPLAIN ANALYZE the results are:

    QUERY PLAN                                                                                                       
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=320957.49..320962.39 rows=490 width=12) (actual time=2612.748..2612.814 rows=459 loops=1)
   Group Key: "EventType".id
   ->  Hash Join  (cost=122.12..312038.18 rows=1783862 width=4) (actual time=386.978..2501.421 rows=690140 loops=1)
         Hash Cond: ("Event"."eventTypeId" = "EventType".id)
         ->  Seq Scan on events "Event"  (cost=0.00..239469.41 rows=14562141 width=4) (actual time=0.026..1272.817 rows=14558556 loops=1)
         ->  Hash  (cost=116.00..116.00 rows=490 width=4) (actual time=0.323..0.323 rows=459 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 25kB
               ->  Nested Loop  (cost=0.28..116.00 rows=490 width=4) (actual time=0.061..0.263 rows=459 loops=1)
                     ->  Seq Scan on projects "EventType->Project"  (cost=0.00..1.56 rows=1 width=4) (actual time=0.017..0.021 rows=1 loops=1)
                           Filter: (id = 142)
                           Rows Removed by Filter: 47
                     ->  Index Scan using event_types_project_id on event_types "EventType"  (cost=0.28..109.53 rows=490 width=8) (actual time=0.042..0.193 rows=459 loops=1)
                           Index Cond: ("projectId" = 142)
 Planning time: 3.891 ms
 Execution time: 2613.033 ms

It appears that it is scanning the entire events table (which is quite large) and the entire query is taking quite a bit of time. I had thought I could get away with just a scan of the index. My thought was that the index was keeping counts for each of the index keys but maybe that mental model is flawed.

Is there any way to speed up this type of query? If not I can keep track of the counts myself but if it could be simplified by fixing something in the query or schema that seems easier.

Best Answer

SELECT et.id AS event_type_id, count(e."eventTypeId") AS event_count
FROM   event_types et 
LEFT   JOIN events e ON e."eventTypeId" = et.id
WHERE  et."projectId" = 142
GROUP  BY et.id;

Major points

0. Style

Avoid CaMeL-case names if you can, to make your life and ours easier.

At the very least use legal table aliases in the query that do not require double-quoting.

1. Correctness

Your query misses a corner case for its declared purpose:

I want to know, for a given project, how many of each type of event have occurred.

The [INNER] JOIN to events excludes all types with 0 events from the result. Typically, you need a LEFT [OUTER] JOIN, to get the complete list of event types including those with 0 events.

Consequently, count a non-null column from table events instead. The obvious candidate is count(e."eventTypeId"). That column is NULL if (and only if) no event was found for the type, and count() does not count NULL values.

2. Performance

Since referential integrity is enforced with a FK constraint, there is no need involve table projects at all. We have the id, that's all we need.

Consequently, adapt the WHERE clause to WHERE et."projectId" = 142. Shorter, faster.

3. Indexes

Ideally, replace your existing index event_types_project_id on just ("projectId"), with a multi-column index on ("projectId", id). Exactly the same size, multiple benefits. Primarily I am aiming for index-only scans on table event_types. See:

And an index on ("eventTypeId") in table events, which you already have (events_event_type_id).

Also:

I had thought I could get away with just a scan of the index. My thought was that the index was keeping counts for each of the index keys but maybe that mental model is flawed.

No, the index does not keep counts. There are only internal statistics for the most common values and such. But you can get away with just a scan of the index - if preconditions for index-only scans are met.