PostgreSQL – Improve Performance of COUNT/GROUP-BY in Large Table

countgroup byindexperformancepostgresql

I am running PostgresSQL 9.2 and have a 12 column relation with about 6,700,000 rows. It contains nodes in a 3D space, each one referencing a user (who created it). To query which user has created how many nodes I do the following (added explain analyze for more information):

EXPLAIN ANALYZE SELECT user_id, count(user_id) FROM treenode WHERE project_id=1 GROUP BY user_id;
                                                    QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=253668.70..253669.07 rows=37 width=8) (actual time=1747.620..1747.623 rows=38 loops=1)
   ->  Seq Scan on treenode  (cost=0.00..220278.79 rows=6677983 width=8) (actual time=0.019..886.803 rows=6677983 loops=1)
         Filter: (project_id = 1)
 Total runtime: 1747.653 ms

As you can see, this takes about 1.7 seconds. This isn't too bad considering the amount of data, but I wonder if this can be improved. I tried to add a BTree index on the user column, but this didn't help in any way.

Do you have alternative suggestions?


For the sake of completeness, this is the complete table definition with all it's indices (without foreign key constraints, references and triggers):

    Column     |           Type           |                      Modifiers                    
---------------+--------------------------+------------------------------------------------------
 id            | bigint                   | not null default nextval('concept_id_seq'::regclass)
 user_id       | bigint                   | not null
 creation_time | timestamp with time zone | not null default now()
 edition_time  | timestamp with time zone | not null default now()
 project_id    | bigint                   | not null
 location      | double3d                 | not null
 reviewer_id   | integer                  | not null default (-1)
 review_time   | timestamp with time zone |
 editor_id     | integer                  |
 parent_id     | bigint                   |
 radius        | double precision         | not null default 0
 confidence    | integer                  | not null default 5
 skeleton_id   | bigint                   |
Indexes:
    "treenode_pkey" PRIMARY KEY, btree (id)
    "treenode_id_key" UNIQUE CONSTRAINT, btree (id)
    "skeleton_id_treenode_index" btree (skeleton_id)
    "treenode_editor_index" btree (editor_id)
    "treenode_location_x_index" btree (((location).x))
    "treenode_location_y_index" btree (((location).y))
    "treenode_location_z_index" btree (((location).z))
    "treenode_parent_id" btree (parent_id)
    "treenode_user_index" btree (user_id)

Edit: This is the result, when I use the query (and index) proposed by @ypercube (query takes about 5.3 seconds without EXPLAIN ANALYZE):

EXPLAIN ANALYZE SELECT u.id, ( SELECT COUNT(*) FROM treenode AS t WHERE t.project_id=1 AND t.user_id = u.id ) AS number_of_nodes FROM auth_user As u;
                                                                        QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on auth_user u  (cost=0.00..6987937.85 rows=46 width=4) (actual time=29.934..5556.147 rows=46 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=151911.65..151911.66 rows=1 width=0) (actual time=120.780..120.780 rows=1 loops=46)
           ->  Bitmap Heap Scan on treenode t  (cost=4634.41..151460.44 rows=180486 width=0) (actual time=13.785..114.021 rows=145174 loops=46)
                 Recheck Cond: ((project_id = 1) AND (user_id = u.id))
                 Rows Removed by Index Recheck: 461076
                 ->  Bitmap Index Scan on treenode_user_index  (cost=0.00..4589.29 rows=180486 width=0) (actual time=13.082..13.082 rows=145174 loops=46)
                       Index Cond: ((project_id = 1) AND (user_id = u.id))
 Total runtime: 5556.190 ms
(9 rows)

Time: 5556.804 ms

Edit 2: This is the result, when I use an index on project_id, user_id (but no schema optimization, yet) as @erwin-brandstetter suggested (the query runs with 1.5 seconds at the same speed as my original query):

EXPLAIN ANALYZE SELECT user_id, count(user_id) as ct FROM treenode WHERE project_id=1 GROUP BY user_id;
                                                        QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=253670.88..253671.24 rows=37 width=8) (actual time=1807.334..1807.339 rows=38 loops=1)
   ->  Seq Scan on treenode  (cost=0.00..220280.62 rows=6678050 width=8) (actual time=0.183..893.491 rows=6678050 loops=1)
         Filter: (project_id = 1)
 Total runtime: 1807.368 ms
(4 rows)

Best Answer

Main problem is the missing index. But there is more.

SELECT user_id, count(*) AS ct
FROM   treenode
WHERE  project_id = 1
GROUP  BY user_id;
  • You have many bigint columns. Probably overkill. Typically, integer is more than enough for columns like project_id and user_id. This would also help the next item.
    While optimizing the table definition, consider this related answer, with an emphasis on data alignment and padding. But most of the rest applies, too:

  • The elephant in the room: there is no index on project_id. Create one. This is more important than the rest of this answer.
    While being at it, make that a multicolumn index:

    CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id, user_id);
    

    If you followed my advice, integer would be perfect here:

  • user_id is defined NOT NULL, so count(user_id) is equivalent to count(*), but the latter is a bit shorter and faster. (In this specific query, this would even apply without user_id being defined NOT NULL.)

  • id is already the primary key, the additional UNIQUE constraint is useless ballast. Drop it:

    "treenode_pkey" PRIMARY KEY, btree (id)
    "treenode_id_key" UNIQUE CONSTRAINT, btree (id)

    Aside: I'd not use id as column name. Use something descriptive like treenode_id.

Added information

Q: How many different project_id and user_id?
A: not more than five different project_id.

That means Postgres has to read about 20% of the whole table to satisfy your query. Unless it can use an index-only scan, a sequential scan on the table will be faster than involving any indexes. No more performance to gain here - except by optimizing the table and server settings.

As for the index-only scan: To see how effective that can be, run VACUUM ANALYZE if you can afford that (locks the table exclusively). Then try your query again. It should now be moderately faster using only the index. Read this related answer first:

As well as the manual page added with Postgres 9.6 and the Postgres Wiki on index-only scans.