I have this really gnarly query in PostgreSQL 9.4 I need to optimize:
SELECT date_series.date,
sum(case when serial_number not like '1234%' then 1 else 0 end)::int AS v1,
sum(case when serial_number like '1234%' then 1 else 0 end)::int AS v2,
count(*)::int AS total
FROM (
SELECT generate_series::date AS date
FROM (
SELECT generate_series(date '2016-10-25 00:00', date '2017-10-24T23:59:59', interval '1 day')::date
) AS date
) AS date_series
LEFT JOIN (
SELECT site_id, timestamp as date, serial_number
FROM (
SELECT site_id, timestamp::date
FROM data_uploads
WHERE (timestamp > '2016-10-25 00:00') AND (timestamp <= '2017-10-24T23:59:59')
GROUP BY site_id, timestamp
) AS unique_sites
LEFT JOIN (
SELECT sites.id, serial_number
FROM sites
LEFT JOIN devices ON (devices.id = sites.device_id)
) AS site_devices ON (site_devices.id = unique_sites.site_id)
) AS devices_uploading ON (date_series.date = devices_uploading.date)
GROUP BY date_series.date ORDER BY date_series.date ASC
It should look at a data uploads table and get the a count of each of the two different types of devices (v1 and v2) and the total devices uploading by that day. It currently takes minutes to run the query. The data_uploads
table has only 213,692 rows.
I tried simplifying the query. I came up with one that produces the same result and, I believe, reads better, but, in a handful of tests, it's a little slower to run.
SELECT date_series.date,
sum(case when devices_uploading.serial_number not like '1234%' then 1 else 0 end)::int AS v1,
sum(case when devices_uploading.serial_number like '1234%' then 1 else 0 end)::int AS v2,
count(*)::int AS total
FROM (
SELECT generate_series(date '2016-10-25 00:00', date '2017-10-24T23:59:59', interval '1 day')::date AS date
) AS date_series
LEFT JOIN (
SELECT site_id, timestamp::date as date, serial_number FROM data_uploads
LEFT JOIN sites on data_uploads.site_id = sites.id
LEFT JOIN devices on sites.device_id = devices.id
WHERE (timestamp > '2016-10-25 00:00') AND (timestamp <= '2017-10-24T23:59:59')
GROUP BY site_id, serial_number, timestamp
) AS devices_uploading ON date_series.date = devices_uploading.date
GROUP BY date_series.date
ORDER BY date_series.date ASC
What I can decipher from EXPLAIN ANALYZE
is that the grouping and associating sorting on the inner select is the problem.
GROUP BY site_id, timestamp
in the original query and
GROUP BY site_id, serial_number, timestamp
in the new.
What is the best approach for me to take for optimizing this query? I've read a bit on indexes, but it sounds like they aren't a homerun as PostgreSQL sometimes doesn't use the indexes. Also, in my refactored query, I'm not sure I can use an index since the three sort columns are across multiple tables.
Here's a minimal schema:
CREATE TABLE data_uploads (
id serial PRIMARY KEY,
site_id text,
timestamp timestamp without time zone
);
CREATE TABLE sites (
id text PRIMARY KEY,
device_id text
);
CREATE TABLE devices (
id text PRIMARY KEY,
serial_number text
);
CREATE INDEX data_uploads_site_id_timestamp_desc_index ON data_uploads (site_id, timestamp DESC);
CREATE INDEX sites_device_id_index ON sites (device_id);
Query plan for the first query:
Sort (cost=81900.73..81901.23 rows=200 width=13) (actual time=181457.530..181458.031 rows=365 loops=1)
Output: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date), ((sum(CASE WHEN (devices.serial_number !~~ '1234%'::text) THEN 1 ELSE 0 END))::integer), ((sum(CASE WHEN (devices.serial_number ~~ '1234-%'::text) THEN 1 ELSE 0 END))::integer), ((count(*))::integer)
Sort Key: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date)
Sort Method: quicksort Memory: 42kB
Buffers: shared hit=140 read=6545 dirtied=46, temp read=1250 written=1250
-> HashAggregate (cost=81889.58..81893.08 rows=200 width=13) (actual time=181455.988..181456.905 rows=365 loops=1)
Output: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date), (sum(CASE WHEN (devices.serial_number !~~ '1234%'::text) THEN 1 ELSE 0 END))::integer, (sum(CASE WHEN (devices.serial_number ~~ '1234%'::text) THEN 1 ELSE 0 END))::integer, (count(*))::integer
Group Key: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date)
Buffers: shared hit=137 read=6545 dirtied=46, temp read=1250 written=1250
-> Hash Right Join (cost=31240.73..68684.26 rows=880355 width=13) (actual time=100081.085..173667.705 rows=176318 loops=1)
Output: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date), devices.serial_number
Hash Cond: (((data_uploads."timestamp")::date) = ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date))
Buffers: shared hit=137 read=6545 dirtied=46, temp read=1250 written=1250
-> Merge Right Join (cost=31213.22..37404.14 rows=176071 width=13) (actual time=100060.338..164925.771 rows=176317 loops=1)
Output: ((data_uploads."timestamp")::date), devices.serial_number
Merge Cond: (sites.id = datau_ploads.site_id)
Buffers: shared hit=137 read=6545 dirtied=46, temp read=1250 written=1250
-> Sort (cost=804.31..816.31 rows=4799 width=46) (actual time=1112.675..1312.532 rows=4873 loops=1)
Output: sites.id, devices.serial_number
Sort Key: sites.id
Sort Method: quicksort Memory: 582kB
Buffers: shared hit=37 read=241 dirtied=43
-> Hash Left Join (cost=210.91..510.88 rows=4799 width=46) (actual time=345.933..869.094 rows=4873 loops=1)
Output: sites.id, devices.serial_number
Hash Cond: (sites.device_id = devices.id)
Buffers: shared hit=37 read=241 dirtied=43
-> Seq Scan on public.sites (cost=0.00..215.99 rows=4799 width=59) (actual time=0.012..494.866 rows=4873 loops=1)
Output: sites.device_id, sites.id
Buffers: shared hit=32 read=136
-> Hash (cost=154.85..154.85 rows=4485 width=29) (actual time=345.892..345.892 rows=4485 loops=1)
Output: devices.serial_number, devices.id
Buckets: 1024 Batches: 1 Memory Usage: 267kB
Buffers: shared hit=5 read=105 dirtied=43
-> Seq Scan on public.devices (cost=0.00..154.85 rows=4485 width=29) (actual time=0.024..337.914 rows=4485 loops=1)
Output: devices.serial_number, devices.id
Buffers: shared hit=5 read=105 dirtied=43
-> Materialize (cost=30408.91..34374.95 rows=176071 width=41) (actual time=98947.653..142999.221 rows=176317 loops=1)
Output: data_uploads.site_id, ((data_uploads."timestamp")::date), data_uploads."timestamp"
Buffers: shared hit=100 read=6304 dirtied=3, temp read=1250 written=1250
-> Group (cost=30408.91..32174.06 rows=176071 width=45) (actual time=98947.641..134723.593 rows=176317 loops=1)
Output: data_uploads.site_id, (data_uploads."timestamp")::date, data_uploads."timestamp"
Group Key: data_uploads.site_id, data_uploads."timestamp"
Buffers: shared hit=100 read=6304 dirtied=3, temp read=1250 written=1250
-> Sort (cost=30408.91..30850.57 rows=176663 width=45) (actual time=98947.628..125199.772 rows=176320 loops=1)
Output: data_uploads.site_id, data_uploads."timestamp"
Sort Key: data_uploads.site_id, data_uploads."timestamp"
Sort Method: external merge Disk: 9976kB
Buffers: shared hit=100 read=6304 dirtied=3, temp read=1250 written=1250
-> Seq Scan on public.data_uploads (cost=0.00..9576.67 rows=176663 width=45) (actual time=0.038..35225.200 rows=176320 loops=1)
Output: data_uploads.site_id, data_uploads."timestamp"
Filter: ((data_uploads."timestamp" > '2016-10-25 00:00:00'::timestamp without time zone) AND (data_uploads."timestamp" <= '2017-10-24 23:59:59'::timestamp without time zone))
Rows Removed by Filter: 37412
Buffers: shared hit=97 read=6304 dirtied=3
-> Hash (cost=15.02..15.02 rows=1000 width=4) (actual time=20.731..20.731 rows=365 loops=1)
Output: ((generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date)
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Result (cost=0.00..5.01 rows=1000 width=0) (actual time=0.018..0.697 rows=365 loops=1)
Output: (generate_series(('2016-10-25'::date)::timestamp with time zone, ('2017-10-24'::date)::timestamp with time zone, '1 day'::interval))::date
Planning time: 2.256 ms
Execution time: 181539.882 ms
Best Answer
There are several parts of the execution plan that indicate that Postgres simply does not have enough memory to efficiently process you query:
The most important impact being:
Which means the sorting of 176320 rows could not be done in memory and Postgres had to read and write 9MB of data to do that and it took over two minutes to do that. One conclusion is that your
work_mem
is way too low to process this kind of query. But the other conclusion is also that you seem to have a terribly slow harddisk.The materialize stop a bit further up the plan, also indicates that you have allocated not enough memory to Postgres.
The seq scan on data_uploads:
shows that the condition
doesn't make use of an index. But as that condition only filters out about 20% of the data, even a suitable index wouldn't help. But 35 seconds to do a seq scan on a table with only 176320 rows. On my desktop computer (regular harddrive, no SSD) doing a seq scan on a 1.6 million rows table only takes about 4 seconds (when the table is not cached!). So again this seems to indicate a pretty slow harddisk.
You should definitely increase
work_mem
. Given the figures, I'd start with something like:set work_mem='32MB';
If you have sufficient memory, then maybe also try
64MB
. Note that the limit defined here is not a hard overall limit. The work_mem is something a single query can request several times (once for each sort, group or hashing step it needs to run)You might also want to investigate why your harddisk is so slow, or if you can replace that with something faster. For the time being, it might help to increase
shared_buffers
so that Postgres can keep more data memory and it doesn't need to access the hard disk that often. However that only helps for the second time you need something from the disk.