I am running PostgresSQL 9.2 and have a 12 column relation with about 6,700,000 rows. It contains nodes in a 3D space, each one referencing a user (who created it). To query which user has created how many nodes I do the following (added explain analyze
for more information):
EXPLAIN ANALYZE SELECT user_id, count(user_id) FROM treenode WHERE project_id=1 GROUP BY user_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=253668.70..253669.07 rows=37 width=8) (actual time=1747.620..1747.623 rows=38 loops=1)
-> Seq Scan on treenode (cost=0.00..220278.79 rows=6677983 width=8) (actual time=0.019..886.803 rows=6677983 loops=1)
Filter: (project_id = 1)
Total runtime: 1747.653 ms
As you can see, this takes about 1.7 seconds. This isn't too bad considering the amount of data, but I wonder if this can be improved. I tried to add a BTree index on the user column, but this didn't help in any way.
Do you have alternative suggestions?
For the sake of completeness, this is the complete table definition with all it's indices (without foreign key constraints, references and triggers):
Column | Type | Modifiers
---------------+--------------------------+------------------------------------------------------
id | bigint | not null default nextval('concept_id_seq'::regclass)
user_id | bigint | not null
creation_time | timestamp with time zone | not null default now()
edition_time | timestamp with time zone | not null default now()
project_id | bigint | not null
location | double3d | not null
reviewer_id | integer | not null default (-1)
review_time | timestamp with time zone |
editor_id | integer |
parent_id | bigint |
radius | double precision | not null default 0
confidence | integer | not null default 5
skeleton_id | bigint |
Indexes:
"treenode_pkey" PRIMARY KEY, btree (id)
"treenode_id_key" UNIQUE CONSTRAINT, btree (id)
"skeleton_id_treenode_index" btree (skeleton_id)
"treenode_editor_index" btree (editor_id)
"treenode_location_x_index" btree (((location).x))
"treenode_location_y_index" btree (((location).y))
"treenode_location_z_index" btree (((location).z))
"treenode_parent_id" btree (parent_id)
"treenode_user_index" btree (user_id)
Edit: This is the result, when I use the query (and index) proposed by @ypercube (query takes about 5.3 seconds without EXPLAIN ANALYZE
):
EXPLAIN ANALYZE SELECT u.id, ( SELECT COUNT(*) FROM treenode AS t WHERE t.project_id=1 AND t.user_id = u.id ) AS number_of_nodes FROM auth_user As u;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on auth_user u (cost=0.00..6987937.85 rows=46 width=4) (actual time=29.934..5556.147 rows=46 loops=1)
SubPlan 1
-> Aggregate (cost=151911.65..151911.66 rows=1 width=0) (actual time=120.780..120.780 rows=1 loops=46)
-> Bitmap Heap Scan on treenode t (cost=4634.41..151460.44 rows=180486 width=0) (actual time=13.785..114.021 rows=145174 loops=46)
Recheck Cond: ((project_id = 1) AND (user_id = u.id))
Rows Removed by Index Recheck: 461076
-> Bitmap Index Scan on treenode_user_index (cost=0.00..4589.29 rows=180486 width=0) (actual time=13.082..13.082 rows=145174 loops=46)
Index Cond: ((project_id = 1) AND (user_id = u.id))
Total runtime: 5556.190 ms
(9 rows)
Time: 5556.804 ms
Edit 2: This is the result, when I use an index
on project_id, user_id
(but no schema optimization, yet) as @erwin-brandstetter suggested (the query runs with 1.5 seconds at the same speed as my original query):
EXPLAIN ANALYZE SELECT user_id, count(user_id) as ct FROM treenode WHERE project_id=1 GROUP BY user_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=253670.88..253671.24 rows=37 width=8) (actual time=1807.334..1807.339 rows=38 loops=1)
-> Seq Scan on treenode (cost=0.00..220280.62 rows=6678050 width=8) (actual time=0.183..893.491 rows=6678050 loops=1)
Filter: (project_id = 1)
Total runtime: 1807.368 ms
(4 rows)
Best Answer
Main problem is the missing index. But there is more.
You have many
bigint
columns. Probably overkill. Typically,integer
is more than enough for columns likeproject_id
anduser_id
. This would also help the next item.While optimizing the table definition, consider this related answer, with an emphasis on data alignment and padding. But most of the rest applies, too:
The elephant in the room: there is no index on
project_id
. Create one. This is more important than the rest of this answer.While being at it, make that a multicolumn index:
If you followed my advice,
integer
would be perfect here:user_id
is definedNOT NULL
, socount(user_id)
is equivalent tocount(*)
, but the latter is a bit shorter and faster. (In this specific query, this would even apply withoutuser_id
being definedNOT NULL
.)id
is already the primary key, the additionalUNIQUE
constraint is useless ballast. Drop it:Aside: I'd not use
id
as column name. Use something descriptive liketreenode_id
.Added information
Q:
How many different project_id and user_id?
A:
not more than five different project_id
.That means Postgres has to read about 20% of the whole table to satisfy your query. Unless it can use an index-only scan, a sequential scan on the table will be faster than involving any indexes. No more performance to gain here - except by optimizing the table and server settings.
As for the index-only scan: To see how effective that can be, run
VACUUM ANALYZE
if you can afford that (locks the table exclusively). Then try your query again. It should now be moderately faster using only the index. Read this related answer first:As well as the manual page added with Postgres 9.6 and the Postgres Wiki on index-only scans.