Postgresql – How to optimize PostgreSQL query with multiple joins and subqueries and slow grouping/sorting

performancepostgresql-9.4query-performance

I have this really gnarly query in PostgreSQL 9.4 I need to optimize:

SELECT date_series.date, 
    sum(case when serial_number not like '1234%' then 1 else 0 end)::int AS v1,
    sum(case when serial_number like '1234%' then 1 else 0 end)::int AS v2,
    count(*)::int AS total
FROM (
    SELECT generate_series::date AS date 
    FROM (
        SELECT generate_series(date '2016-10-25 00:00', date '2017-10-24T23:59:59', interval '1 day')::date
    ) AS date
) AS date_series 
LEFT JOIN (
    SELECT site_id, timestamp as date, serial_number
    FROM (
        SELECT site_id, timestamp::date
        FROM data_uploads 
        WHERE (timestamp > '2016-10-25 00:00') AND (timestamp <= '2017-10-24T23:59:59') 
        GROUP BY site_id, timestamp
    ) AS unique_sites 
    LEFT JOIN (
        SELECT sites.id, serial_number
        FROM sites 
        LEFT JOIN devices ON (devices.id = sites.device_id)
    ) AS site_devices ON (site_devices.id = unique_sites.site_id)
) AS devices_uploading ON (date_series.date = devices_uploading.date) 
GROUP BY date_series.date ORDER BY date_series.date ASC

It should look at a data uploads table and get the a count of each of the two different types of devices (v1 and v2) and the total devices uploading by that day. It currently takes minutes to run the query. The data_uploads table has only 213,692 rows.

I tried simplifying the query. I came up with one that produces the same result and, I believe, reads better, but, in a handful of tests, it's a little slower to run.

SELECT date_series.date, 
    sum(case when devices_uploading.serial_number not like '1234%' then 1 else 0 end)::int AS v1,
    sum(case when devices_uploading.serial_number like '1234%' then 1 else 0 end)::int AS v2,
    count(*)::int AS total
FROM (
    SELECT generate_series(date '2016-10-25 00:00', date '2017-10-24T23:59:59', interval '1 day')::date AS date 
) AS date_series
LEFT JOIN (
    SELECT site_id, timestamp::date as date, serial_number FROM data_uploads 
    LEFT JOIN sites on data_uploads.site_id = sites.id
    LEFT JOIN devices on sites.device_id = devices.id
    WHERE (timestamp > '2016-10-25 00:00') AND (timestamp <= '2017-10-24T23:59:59') 
    GROUP BY site_id, serial_number, timestamp
) AS devices_uploading ON date_series.date = devices_uploading.date
GROUP BY date_series.date
ORDER BY date_series.date ASC

What I can decipher from EXPLAIN ANALYZE is that the grouping and associating sorting on the inner select is the problem.

GROUP BY site_id, timestamp

in the original query and

GROUP BY site_id, serial_number, timestamp

in the new.

What is the best approach for me to take for optimizing this query? I've read a bit on indexes, but it sounds like they aren't a homerun as PostgreSQL sometimes doesn't use the indexes. Also, in my refactored query, I'm not sure I can use an index since the three sort columns are across multiple tables.

Here's a minimal schema:

CREATE TABLE data_uploads (
  id             serial PRIMARY KEY,
  site_id        text,
  timestamp      timestamp without time zone
);

CREATE TABLE sites (
  id             text PRIMARY KEY,
  device_id      text
);

CREATE TABLE devices (
  id             text PRIMARY KEY,
  serial_number  text
);

CREATE INDEX data_uploads_site_id_timestamp_desc_index ON data_uploads (site_id, timestamp DESC);
CREATE INDEX sites_device_id_index ON sites (device_id);

Query plan for the first query:

Sort  (cost=81900.73..81901.23 rows=200 width=13) (actual time=181457.530..181458.031 rows=365 loops=1)
  Output: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date), ((sum(CASE WHEN (devices.serial_number !~~ '1234%'::text) THEN 1 ELSE 0 END))::integer), ((sum(CASE WHEN (devices.serial_number ~~ '1234-%'::text) THEN 1 ELSE 0 END))::integer), ((count(*))::integer)
  Sort Key: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date)
  Sort Method: quicksort  Memory: 42kB
  Buffers: shared hit=140 read=6545 dirtied=46, temp read=1250 written=1250
  ->  HashAggregate  (cost=81889.58..81893.08 rows=200 width=13) (actual time=181455.988..181456.905 rows=365 loops=1)
        Output: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date), (sum(CASE WHEN (devices.serial_number !~~ '1234%'::text) THEN 1 ELSE 0 END))::integer, (sum(CASE WHEN (devices.serial_number ~~ '1234%'::text) THEN 1 ELSE 0 END))::integer, (count(*))::integer
        Group Key: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date)
        Buffers: shared hit=137 read=6545 dirtied=46, temp read=1250 written=1250
        ->  Hash Right Join  (cost=31240.73..68684.26 rows=880355 width=13) (actual time=100081.085..173667.705 rows=176318 loops=1)
              Output: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date), devices.serial_number
              Hash Cond: (((data_uploads."timestamp")::date) = ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date))
              Buffers: shared hit=137 read=6545 dirtied=46, temp read=1250 written=1250
              ->  Merge Right Join  (cost=31213.22..37404.14 rows=176071 width=13) (actual time=100060.338..164925.771 rows=176317 loops=1)
                    Output: ((data_uploads."timestamp")::date), devices.serial_number
                    Merge Cond: (sites.id = datau_ploads.site_id)
                    Buffers: shared hit=137 read=6545 dirtied=46, temp read=1250 written=1250
                    ->  Sort  (cost=804.31..816.31 rows=4799 width=46) (actual time=1112.675..1312.532 rows=4873 loops=1)
                          Output: sites.id, devices.serial_number
                          Sort Key: sites.id
                          Sort Method: quicksort  Memory: 582kB
                          Buffers: shared hit=37 read=241 dirtied=43
                          ->  Hash Left Join  (cost=210.91..510.88 rows=4799 width=46) (actual time=345.933..869.094 rows=4873 loops=1)
                                Output: sites.id, devices.serial_number
                                Hash Cond: (sites.device_id = devices.id)
                                Buffers: shared hit=37 read=241 dirtied=43
                                ->  Seq Scan on public.sites  (cost=0.00..215.99 rows=4799 width=59) (actual time=0.012..494.866 rows=4873 loops=1)
                                      Output: sites.device_id, sites.id
                                      Buffers: shared hit=32 read=136
                                ->  Hash  (cost=154.85..154.85 rows=4485 width=29) (actual time=345.892..345.892 rows=4485 loops=1)
                                      Output: devices.serial_number, devices.id
                                      Buckets: 1024  Batches: 1  Memory Usage: 267kB
                                      Buffers: shared hit=5 read=105 dirtied=43
                                      ->  Seq Scan on public.devices  (cost=0.00..154.85 rows=4485 width=29) (actual time=0.024..337.914 rows=4485 loops=1)
                                            Output: devices.serial_number, devices.id
                                            Buffers: shared hit=5 read=105 dirtied=43
                    ->  Materialize  (cost=30408.91..34374.95 rows=176071 width=41) (actual time=98947.653..142999.221 rows=176317 loops=1)
                          Output: data_uploads.site_id, ((data_uploads."timestamp")::date), data_uploads."timestamp"
                          Buffers: shared hit=100 read=6304 dirtied=3, temp read=1250 written=1250
                          ->  Group  (cost=30408.91..32174.06 rows=176071 width=45) (actual time=98947.641..134723.593 rows=176317 loops=1)
                                Output: data_uploads.site_id, (data_uploads."timestamp")::date, data_uploads."timestamp"
                                Group Key: data_uploads.site_id, data_uploads."timestamp"
                                Buffers: shared hit=100 read=6304 dirtied=3, temp read=1250 written=1250
                                ->  Sort  (cost=30408.91..30850.57 rows=176663 width=45) (actual time=98947.628..125199.772 rows=176320 loops=1)
                                      Output: data_uploads.site_id, data_uploads."timestamp"
                                      Sort Key: data_uploads.site_id, data_uploads."timestamp"
                                      Sort Method: external merge  Disk: 9976kB
                                      Buffers: shared hit=100 read=6304 dirtied=3, temp read=1250 written=1250
                                      ->  Seq Scan on public.data_uploads  (cost=0.00..9576.67 rows=176663 width=45) (actual time=0.038..35225.200 rows=176320 loops=1)
                                            Output: data_uploads.site_id, data_uploads."timestamp"
                                            Filter: ((data_uploads."timestamp" > '2016-10-25 00:00:00'::timestamp without time zone) AND (data_uploads."timestamp" <= '2017-10-24 23:59:59'::timestamp without time zone))
                                            Rows Removed by Filter: 37412
                                            Buffers: shared hit=97 read=6304 dirtied=3
              ->  Hash  (cost=15.02..15.02 rows=1000 width=4) (actual time=20.731..20.731 rows=365 loops=1)
                    Output: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date)
                    Buckets: 1024  Batches: 1  Memory Usage: 13kB
                    ->  Result  (cost=0.00..5.01 rows=1000 width=0) (actual time=0.018..0.697 rows=365 loops=1)
                          Output: (generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date
Planning time: 2.256 ms
Execution time: 181539.882 ms

Best Answer

There are several parts of the execution plan that indicate that Postgres simply does not have enough memory to efficiently process you query:

The most important impact being:

->  Sort  (cost=30408.91..30850.57 rows=176663 width=45) (actual time=98947.628..125199.772 rows=176320 loops=1)
      Output: data_uploads.site_id, data_uploads."timestamp"
      Sort Key: data_uploads.site_id, data_uploads."timestamp"
      Sort Method: external merge  Disk: 9976kB

Which means the sorting of 176320 rows could not be done in memory and Postgres had to read and write 9MB of data to do that and it took over two minutes to do that. One conclusion is that your work_mem is way too low to process this kind of query. But the other conclusion is also that you seem to have a terribly slow harddisk.

The materialize stop a bit further up the plan, also indicates that you have allocated not enough memory to Postgres.

The seq scan on data_uploads:

Seq Scan on public.data_uploads  (cost=0.00..9576.67 rows=176663 width=45) (actual time=0.038..35225.200 rows=176320 loops=1)

shows that the condition

WHERE (timestamp > '2016-10-25 00:00') AND (timestamp <= '2017-10-24T23:59:59') 

doesn't make use of an index. But as that condition only filters out about 20% of the data, even a suitable index wouldn't help. But 35 seconds to do a seq scan on a table with only 176320 rows. On my desktop computer (regular harddrive, no SSD) doing a seq scan on a 1.6 million rows table only takes about 4 seconds (when the table is not cached!). So again this seems to indicate a pretty slow harddisk.

You should definitely increase work_mem. Given the figures, I'd start with something like:

set work_mem='32MB';

If you have sufficient memory, then maybe also try 64MB. Note that the limit defined here is not a hard overall limit. The work_mem is something a single query can request several times (once for each sort, group or hashing step it needs to run)

You might also want to investigate why your harddisk is so slow, or if you can replace that with something faster. For the time being, it might help to increase shared_buffers so that Postgres can keep more data memory and it doesn't need to access the hard disk that often. However that only helps for the second time you need something from the disk.