I'm not very experienced with database query optimization. I've been reading through similar questions here and Postgres tuning articles online, but unfortunately I haven't had any luck. Here's what I have (DB version, table information, number of rows, EXPLAIN ANALYZE
of the two queries I would like to optimize):
foodb=# SELECT version();
version
---------------------------------------------------------------------------------------------------------------
PostgreSQL 9.3.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
(1 row)
foodb=# \d+ datapoints
Table "public.datapoints"
Column | Type | Modifiers | Storage | Stats target | Description
--------+-----------------------------+-----------+----------+--------------+-------------
tag | character varying(200) | not null | extended | |
time | timestamp without time zone | not null | plain | |
value | double precision | | plain | |
Indexes:
"ix_datapoints_tag" btree (tag)
"ix_datapoints_tag_time_value" btree (tag, "time", value)
Has OIDs: no
foodb=# SELECT COUNT(*) FROM datapoints;
count
-----------
162727709
(1 row)
foodb=# EXPLAIN ANALYZE SELECT DISTINCT tag FROM datapoints;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=5035440.60..5035691.99 rows=25139 width=106) (actual time=74304.278..74357.708 rows=218367 loops=1)
-> Seq Scan on datapoints (cost=0.00..4641784.68 rows=157462368 width=106) (actual time=0.019..24484.872 rows=162727709 loops=1)
Total runtime: 74371.153 ms
(3 rows)
foodb=# EXPLAIN ANALYZE SELECT tag, min(time) as minTime, max(time) as maxTime FROM datapoints GROUP BY tag;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=5822752.44..5823003.83 rows=25139 width=114) (actual time=86848.009..86908.264 rows=218367 loops=1)
-> Seq Scan on datapoints (cost=0.00..4641784.68 rows=157462368 width=114) (actual time=0.010..19226.031 rows=162727709 loops=1)
Total runtime: 86922.564 ms
(3 rows)
I tried to make the following configuration changes in postgresql.conf
, but couldn't see much performance improvement:
max_connections = 128 // was 300
work_mem = 64m // was 8m
shared_buffers = 4096m // was 128m
effective_cache_size = 4096m // was 128m
What do you suggest so that I can improve the performance of these two queries?
SELECT DISTINCT tag FROM datapoints;
SELECT tag, min(time) as minTime, max(time) as maxTime FROM datapoints GROUP BY tag;
Update 1:
Turning off sequential scan doesn't seem to help much.
foodb=# set enable_seqscan = off;
SET
foodb=# EXPLAIN ANALYZE SELECT DISTINCT tag FROM datapoints;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.82..591769146.07 rows=25139 width=106) (actual time=0.037..81086.671 rows=218367 loops=1)
-> Index Only Scan using ix_datapoints_tag on datapoints (cost=0.82..591375490.15 rows=157462368 width=106) (actual time=0.033..54393.956 rows=162727709 loops=1)
Heap Fetches: 162727709
Total runtime: 81104.455 ms
(4 rows)
Update 2:
Tried creating an index on time
column, and ran the second query, still no luck.
foodb=# CREATE INDEX ix_datapoints_time ON datapoints(time);
CREATE INDEX
foodb=# EXPLAIN ANALYZE SELECT tag, min(time) as minTime, max(time) as maxTime FROM datapoints GROUP BY tag;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=5914895.96..5915147.35 rows=25139 width=114) (actual time=91870.389..91931.164 rows=218367 loops=1)
-> Seq Scan on datapoints (cost=0.00..4694438.12 rows=162727712 width=114) (actual time=0.013..24101.108 rows=162727709 loops=1)
Total runtime: 91944.348 ms
(3 rows)
Best Answer
I run a test myself and figured out that changing the DB strcture helps a lot the performance of the first query and just a bit of the second one.
My configuration
My PostgreSQL configuration is default except the working memory, which is set to yours (
set work_mem = '64MB'
). I'm working on an SSD on a laptop with PostgreSQL 9.5.1. And I'm no expert!New schematization
With this setup we put the tags in a separate table to easilly check they are unique.
Populate tables
I filled the database with random data but 10x less than in your case or it would take me > 1 h to complete the inserts. Any result I will obtain from benchmarking, will also be multiplied by 10 to estimate a table of your size.
Notice! I do not know how many tags you have. It's just hypotetical.
A little bit of cleanup before we start and we let the query planner gather some statistics about the tables.
Comparing with your hardware
First of all I'll create a table like yours (still 10x smaller) to estimate the difference between my and your hardware speed.
By running the two queries as you wrote them on datapoints_joined, the results 246.15766 s and 232.87162 s (already multiplied by 10) show that my computer is slower by 2.67 and 3.30 times, so approximately 3x slower.
Any result I will obtain from benchmarking, will also be divided by 3 to estimate a hardware of your speed. So basically it should be:
[I do really hope that all my estimates are ok...]
First query
Since the tag column is already
UNIQUE
, there is no need for aDISTINCT
.Result
On your machine (multipying the execution time by 10/3) it should take 143.33 ms = 0.14333 s which is ~519x faster than 74371.153 ms. Awesome!
Why is faster
In your case you were performing a
DISTINCT
query on the tag column which is varchar/text. This requires a checkup of the whole string each time.In my case the column is declared as
UNIQUE
and is places in a separate table. This allows a simple select of all the tags which is obviously faster. The insertion of new tags, on the other hand, requires more because of theUNIQUE
check.Second query
Now we perform a
JOIN
to get the actual tag text instead of the foreign key, but we do it after theGROUP BY
on the integer field fk_tag to increase the speed of the joining.On your machine (multipying the execution time by 10/3) it should take 60018.00 ms = 60.018 s which is ~1.45x faster than 86922.564 ms. Not a huge result, but still better than nothing.
A few considerations
Since we are selecting most of the table, PostgreSQL chooses the sequential reading over the index and is right about it. This is the reason I used no index overall.
If you need the first query more than the second one, maybe this can be a solution for you. Please try it on your machine to be sure.