Postgresql – Optimizing SELECT queries

optimizationpostgresqlquery-performance

I'm not very experienced with database query optimization. I've been reading through similar questions here and Postgres tuning articles online, but unfortunately I haven't had any luck. Here's what I have (DB version, table information, number of rows, EXPLAIN ANALYZE of the two queries I would like to optimize):

foodb=# SELECT version();
                                                    version
---------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.3.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
(1 row)

foodb=# \d+ datapoints
                                Table "public.datapoints"
 Column |            Type             | Modifiers | Storage  | Stats target | Description
--------+-----------------------------+-----------+----------+--------------+-------------
 tag    | character varying(200)      | not null  | extended |              |
 time   | timestamp without time zone | not null  | plain    |              |
 value  | double precision            |           | plain    |              |
Indexes:
    "ix_datapoints_tag" btree (tag)
    "ix_datapoints_tag_time_value" btree (tag, "time", value)
Has OIDs: no

foodb=# SELECT COUNT(*) FROM datapoints;
   count
-----------
 162727709
(1 row)

foodb=# EXPLAIN ANALYZE SELECT DISTINCT tag FROM datapoints;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=5035440.60..5035691.99 rows=25139 width=106) (actual time=74304.278..74357.708 rows=218367 loops=1)
   ->  Seq Scan on datapoints  (cost=0.00..4641784.68 rows=157462368 width=106) (actual time=0.019..24484.872 rows=162727709 loops=1)
 Total runtime: 74371.153 ms
(3 rows)

foodb=# EXPLAIN ANALYZE SELECT tag, min(time) as minTime, max(time) as maxTime FROM datapoints GROUP BY tag;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=5822752.44..5823003.83 rows=25139 width=114) (actual time=86848.009..86908.264 rows=218367 loops=1)
   ->  Seq Scan on datapoints  (cost=0.00..4641784.68 rows=157462368 width=114) (actual time=0.010..19226.031 rows=162727709 loops=1)
 Total runtime: 86922.564 ms
(3 rows)

I tried to make the following configuration changes in postgresql.conf, but couldn't see much performance improvement:

max_connections = 128 // was 300
work_mem = 64m // was 8m
shared_buffers = 4096m // was 128m
effective_cache_size = 4096m // was 128m

What do you suggest so that I can improve the performance of these two queries?

SELECT DISTINCT tag FROM datapoints;
SELECT tag, min(time) as minTime, max(time) as maxTime FROM datapoints GROUP BY tag;

Update 1:

Turning off sequential scan doesn't seem to help much.

foodb=# set enable_seqscan = off;
SET
foodb=# EXPLAIN ANALYZE SELECT DISTINCT tag FROM datapoints;
                                                                              QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.82..591769146.07 rows=25139 width=106) (actual time=0.037..81086.671 rows=218367 loops=1)
   ->  Index Only Scan using ix_datapoints_tag on datapoints  (cost=0.82..591375490.15 rows=157462368 width=106) (actual time=0.033..54393.956 rows=162727709 loops=1)
         Heap Fetches: 162727709
 Total runtime: 81104.455 ms
(4 rows)

Update 2:

Tried creating an index on time column, and ran the second query, still no luck.

foodb=# CREATE INDEX ix_datapoints_time ON datapoints(time);
CREATE INDEX
foodb=# EXPLAIN ANALYZE SELECT tag, min(time) as minTime, max(time) as maxTime FROM datapoints GROUP BY tag;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=5914895.96..5915147.35 rows=25139 width=114) (actual time=91870.389..91931.164 rows=218367 loops=1)
   ->  Seq Scan on datapoints  (cost=0.00..4694438.12 rows=162727712 width=114) (actual time=0.013..24101.108 rows=162727709 loops=1)
 Total runtime: 91944.348 ms
(3 rows)

Best Answer

I run a test myself and figured out that changing the DB strcture helps a lot the performance of the first query and just a bit of the second one.

My configuration

My PostgreSQL configuration is default except the working memory, which is set to yours (set work_mem = '64MB'). I'm working on an SSD on a laptop with PostgreSQL 9.5.1. And I'm no expert!

New schematization

With this setup we put the tags in a separate table to easilly check they are unique.

CREATE TABLE IF NOT EXISTS tags (
    id  serial NOT NULL
  , tag text      NOT NULL UNIQUE

  , PRIMARY KEY (id)
    );

CREATE TABLE IF NOT EXISTS datapoints (
    fk_tag integer                     NOT NULL
  , time   timestamp without time zone NOT NULL
  , value  double precision

  , FOREIGN KEY (fk_tag)
        REFERENCES tags (id)
        ON UPDATE CASCADE  -- customize as needed
        ON DELETE RESTRICT -- customize as needed
    );

Populate tables

I filled the database with random data but 10x less than in your case or it would take me > 1 h to complete the inserts. Any result I will obtain from benchmarking, will also be multiplied by 10 to estimate a table of your size.

Notice! I do not know how many tags you have. It's just hypotetical.

INSERT INTO tags (tag)
    SELECT md5(random()::text) || md5(random()::text)
        FROM generate_series(1, 100000); -- Hypothetical number of tags!

INSERT INTO datapoints (fk_tag, time, value)
    SELECT trunc(random() * (SELECT max(id) - min(id) FROM tags) + (SELECT min(id) FROM tags) -- 99999 + 628940
        ,  current_timestamp
        ,  random()
        FROM generate_series(1,16272771);
        -- 10x less rows than you have.
        -- This INSERT took ~ 7 minutes to finish.

A little bit of cleanup before we start and we let the query planner gather some statistics about the tables.

VACUUM;
ANALYZE tags;
ANALYZE datapoints;

Comparing with your hardware

First of all I'll create a table like yours (still 10x smaller) to estimate the difference between my and your hardware speed.

SELECT t.tag
    ,  d.time
    ,  d.value
    INTO datapoints_joined
    FROM datapoints AS d
    INNER JOIN tags AS t
    ON d.fk_tag = t.id;

By running the two queries as you wrote them on datapoints_joined, the results 246.15766 s and 232.87162 s (already multiplied by 10) show that my computer is slower by 2.67 and 3.30 times, so approximately 3x slower.

Any result I will obtain from benchmarking, will also be divided by 3 to estimate a hardware of your speed. So basically it should be:

my execution time * 10/3 = your execution time

[I do really hope that all my estimates are ok...]

First query

SELECT tag FROM tags;

Since the tag column is already UNIQUE, there is no need for a DISTINCT.

Result

EXPLAIN ANALYZE SELECT tag FROM tags;

Seq Scan on tags  (cost=0.00..2235.00 rows=100000 width=65) (actual time=0.011..34.362 rows=100000 loops=1)
Planning time: 0.059 ms
Execution time: 42.998 ms -- SUCH SPEED, MUCH WOW!

On your machine (multipying the execution time by 10/3) it should take 143.33 ms = 0.14333 s which is ~519x faster than 74371.153 ms. Awesome!

Why is faster

In your case you were performing a DISTINCT query on the tag column which is varchar/text. This requires a checkup of the whole string each time.

In my case the column is declared as UNIQUE and is places in a separate table. This allows a simple select of all the tags which is obviously faster. The insertion of new tags, on the other hand, requires more because of the UNIQUE check.

Second query

Now we perform a JOIN to get the actual tag text instead of the foreign key, but we do it after the GROUP BY on the integer field fk_tag to increase the speed of the joining.

SELECT t.tag
    ,  d.minTime
    ,  d.maxTime
    FROM (
        SELECT fk_tag
            , min(time) AS minTime
            , max(time) AS maxTime
            FROM datapoints
            GROUP BY fk_tag
        ) AS d
    INNER JOIN tags AS t
    ON d.fk_tag = t.id;

Hash Join  (cost=391905.20..395209.83 rows=97915 width=81) (actual time=17849.685..17995.566 rows=99999 loops=1)
 Hash Cond: (datapoints.fk_tag = t.id)
 ->  HashAggregate  (cost=388420.20..389399.35 rows=97915 width=12) (actual time=17741.489..17807.022 rows=99999 loops=1)
       Group Key: datapoints.fk_tag
       ->  Seq Scan on datapoints  (cost=0.00..266375.40 rows=16272640 width=12) (actual time=0.006..2720.880 rows=16272771 loops=1)
 ->  Hash  (cost=2235.00..2235.00 rows=100000 width=69) (actual time=108.116..108.116 rows=100000 loops=1)
       Buckets: 131072  Batches: 1  Memory Usage: 11181kB
       ->  Seq Scan on tags t  (cost=0.00..2235.00 rows=100000 width=69) (actual time=0.011..46.569 rows=100000 loops=1)
Planning time: 0.241 ms
Execution time: 18005.401 ms -- MEH, NOT SO FAST

On your machine (multipying the execution time by 10/3) it should take 60018.00 ms = 60.018 s which is ~1.45x faster than 86922.564 ms. Not a huge result, but still better than nothing.

A few considerations

Since we are selecting most of the table, PostgreSQL chooses the sequential reading over the index and is right about it. This is the reason I used no index overall.

If you need the first query more than the second one, maybe this can be a solution for you. Please try it on your machine to be sure.