Postgresql – Less efficient query plan after merging tables

postgresqlpostgresql-performance

BACKGROUND

I have an app storing sensor data from some IoT-devices, currently ~108 million individual datagrams spanning the last 6+ years. Various different time based queries became slower over time, so I'm testing
some changes to the schema to improve things.

For historical reasons, the datagrams have been stored in 4 different tables: 2 tables because the datagrams contained two "domains", one table to map where the datagram came from and another one to maintain some history of status changes per device. That resulted in 3 JOINs in most cases, additional indexes per table increasing storage and duplication of data of columns for foreign keys increasing storage as well.

One important thing to note might be that data is NOT monotonic increasing regarding time only. Instead, it might be that newer rows for some device are older than already stored ones, some old rows might be deleted to add newer ones for the same day and stuff like that. Though, that is an exception from the rule, in most cases new rows per device mean more up-to-date data.

TABLE CHANGES

People on pgsql-general@lists.postgresql.org suggested to merge at least some of the tables again to reduce the number of JOINs, possibly use partitioning, check indexes and e.g. use GIN. So I decided to test with ALL former tables being merged into one, as that would make partitioning easier, some indexes, foreign keys etc. unnecessary at all and should at least reduce the amount of used storage.

CREATE TABLE clt_rec
(
  id          bigserial                 NOT NULL,
  oms_rec     bigint                    NOT NULL,
  captured_at timestamp with time zone  NOT NULL,
  rssi        smallint                  NOT NULL
);
CREATE TABLE oms_rec
(
  id        bigserial NOT NULL,
  meter     integer   NOT NULL,
  encrypted bytea,
  decrypted bytea
);
CREATE TABLE clt_rec_src
(
  id          bigserial NOT NULL,
  real_estate integer   NOT NULL,
  collector   integer   NOT NULL,
  clt_rec     bigserial
);
CREATE TABLE meter_status_history
(
  id      serial  NOT NULL,
  oms_rec bigint  NOT NULL,
  status  smallint
);

vs.

CREATE TABLE datagram
(
  id            bigserial                 NOT NULL,
  src_re        integer                   NOT NULL,
  src_clt       integer                   NOT NULL,
  src_meter     integer                   NOT NULL,
  captured_at   timestamp with time zone  NOT NULL,
  captured_rssi smallint                  NOT NULL,
  oms_status    smallint                  NOT NULL,
  oms_enc       bytea,
  oms_dec       bytea
);

STORAGE IMPROVEMENTS

The old schema consumed ~42 GiB, while the new is ~16 GiB without the formerly available indexes and ~25 GiB with the same logical indexes. Which means BTREE with ~2,5 GiB per index on the logically same columns used in queries like before. Alternatively I've tested BRIN index on the same columns as well, but at least their consumption is negligible. From that point of view the new table is an improvement.

relname                                          pg_size_pretty
------------------------------------------------ -------------- 
datagram                                         25 GB
idx_datagram_src_meter_btree                     2323 MB
idx_datagram_src_re_btree                        2323 MB
pk_datagram                                      2323 MB
idx_datagram_captured_at_btree                   2323 MB
idx_datagram_src_clt_btree                       2323 MB
idx_datagram_captured_at_brin                    512 kB
idx_datagram_src_meter_brin                      400 kB
idx_datagram_src_re_brin                         400 kB
idx_datagram_src_clt_brin                        400 kB

OLD QUERY PLAN

  • 2 workers used
  • BTREE-index idx_clt_rec_captured_at used with 3 iterations
Unique  (cost=1669026.62..1669182.57 rows=10397 width=51) (actual time=1295.902..1406.325 rows=22681 loops=1)
Planning Time: 25.434 ms
Execution Time: 1409.988 ms
  ->  Sort  (cost=1669026.62..1669052.61 rows=10397 width=51) (actual time=1295.901..1388.117 rows=104607 loops=1)
        Sort Method: external merge  Disk: 5952kB
        Sort Key: real_estate.number, flat.number, meter.mfct_code, meter.reading_serial, meter.type, clt_rec.captured_at DESC
        ->  Gather  (cost=265793.42..1668332.93 rows=10397 width=51) (actual time=119.077..681.224 rows=104607 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Hash Join  (cost=264793.42..1666293.23 rows=4332 width=51) (actual time=96.080..638.802 rows=34869 loops=3)
                    Hash Cond: (oms_rec.meter = meter.id)
                    ->  Nested Loop  (cost=1.14..1400747.39 rows=189399 width=20) (actual time=0.145..496.366 rows=171456 loops=3)
                    ->  Hash  (cost=264709.53..264709.53 rows=6620 width=39) (actual time=95.521..95.528 rows=40044 loops=3)
                          Buckets: 65536 (originally 8192)  Batches: 1 (originally 1)  Memory Usage: 3016kB
                          ->  Parallel Index Scan using idx_clt_rec_captured_at on clt_rec  (cost=0.57..14853.95 rows=189399 width=24) (actual time=0.098..81.556 rows=171456 loops=3)
                          ->  Index Scan using pk_oms_rec on oms_rec  (cost=0.57..7.32 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=514369)
                          ->  Hash Join  (cost=145.59..264709.53 rows=6620 width=39) (actual time=9.883..86.390 rows=40044 loops=3)
                                Index Cond: (id = clt_rec.oms_rec)
                                Index Cond: ((captured_at >= ('2020-08-28 10:34:32.855+02'::timestamp with time zone - '5 days'::interval)) AND (captured_at <= ('2020-08-28 10:34:32.855+02'::timestamp with time zone + '00:00:00'::interval)))
                                Hash Cond: (meter.meter_bcd = meter_bcd.id)
                                ->  Index Scan using pk_meter on meter  (cost=0.42..264348.00 rows=40044 width=25) (actual time=9.034..74.556 rows=40044 loops=3)
                                ->  Hash  (cost=143.38..143.38 rows=143 width=22) (actual time=0.827..0.833 rows=113 loops=3)
                                      Index Cond: (id = ANY ('{[...]}'::integer[]))
                                      Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                      ->  Hash Join  (cost=125.05..143.38 rows=143 width=22) (actual time=0.710..0.814 rows=113 loops=3)
                                            Hash Cond: (meter_bcd.flat = flat.id)
                                            ->  Seq Scan on meter_bcd  (cost=0.00..13.65 rows=865 width=8) (actual time=0.040..0.097 rows=865 loops=3)
                                            ->  Hash  (cost=123.60..123.60 rows=116 width=22) (actual time=0.626..0.631 rows=113 loops=3)
                                                  Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                                  ->  Hash Join  (cost=110.72..123.60 rows=116 width=22) (actual time=0.538..0.611 rows=113 loops=3)
                                                        Hash Cond: (flat.real_estate = real_estate.id)
                                                        ->  Seq Scan on flat  (cost=0.00..11.03 rows=703 width=9) (actual time=0.030..0.073 rows=703 loops=3)
                                                        ->  Hash  (cost=109.30..109.30 rows=113 width=21) (actual time=0.466..0.467 rows=113 loops=3)
                                                              Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                                              ->  Seq Scan on real_estate  (cost=0.00..109.30 rows=113 width=21) (actual time=0.198..0.447 rows=113 loops=3)
                                                                    Rows Removed by Filter: 570
                                                                    Filter: (id = ANY ('{[...]}'::integer[]))

NEW QUERY PLAN BTREE

  • NO worker planned/used
  • idx_datagram_captured_at_btree used with one iteration
Unique  (cost=2005167.50..2005337.35 rows=11323 width=51) (actual time=6410.799..6522.333 rows=22681 loops=1)
  ->  Sort  (cost=2005167.50..2005195.81 rows=11323 width=51) (actual time=6410.798..6504.023 rows=104607 loops=1)
        Sort Key: real_estate.number, flat.number, meter.mfct_code, meter.reading_serial, meter.type, datagram.captured_at DESC
        Sort Method: external merge  Disk: 5952kB
        ->  Hash Join  (cost=246164.35..2004405.07 rows=11323 width=51) (actual time=93.802..5776.755 rows=104607 loops=1)
              Hash Cond: (meter.meter_bcd = meter_bcd.id)
              ->  Hash Join  (cost=246019.19..2003889.83 rows=68494 width=37) (actual time=93.067..5744.787 rows=104607 loops=1)
                    Hash Cond: (datagram.src_meter = meter.id)
                    ->  Index Scan using idx_datagram_captured_at_btree on datagram  (cost=0.57..1756571.73 rows=495033 width=20) (actual time=0.054..5451.417 rows=514369 loops=1)
                          Index Cond: ((captured_at >= ('2020-08-28 10:34:32.855+02'::timestamp with time zone - '5 days'::interval)) AND (captured_at <= ('2020-08-28 10:34:32.855+02'::timestamp with time zone + '00:00:00'::interval)))
                    ->  Hash  (cost=245518.07..245518.07 rows=40044 width=25) (actual time=92.791..92.792 rows=40044 loops=1)
                          Buckets: 65536  Batches: 1  Memory Usage: 2859kB
                          ->  Index Scan using pk_meter on meter  (cost=0.42..245518.07 rows=40044 width=25) (actual time=8.506..83.849 rows=40044 loops=1)
                                Index Cond: (id = ANY ('{[...]}'::integer[]))
              ->  Hash  (cost=143.38..143.38 rows=143 width=22) (actual time=0.730..0.733 rows=113 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 13kB
                    ->  Hash Join  (cost=125.05..143.38 rows=143 width=22) (actual time=0.627..0.716 rows=113 loops=1)
                          Hash Cond: (meter_bcd.flat = flat.id)
                          ->  Seq Scan on meter_bcd  (cost=0.00..13.65 rows=865 width=8) (actual time=0.013..0.064 rows=865 loops=1)
                          ->  Hash  (cost=123.60..123.60 rows=116 width=22) (actual time=0.573..0.575 rows=113 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                ->  Hash Join  (cost=110.72..123.60 rows=116 width=22) (actual time=0.492..0.558 rows=113 loops=1)
                                      Hash Cond: (flat.real_estate = real_estate.id)
                                      ->  Seq Scan on flat  (cost=0.00..11.03 rows=703 width=9) (actual time=0.006..0.047 rows=703 loops=1)
                                      ->  Hash  (cost=109.30..109.30 rows=113 width=21) (actual time=0.445..0.446 rows=113 loops=1)
                                            Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                            ->  Seq Scan on real_estate  (cost=0.00..109.30 rows=113 width=21) (actual time=0.180..0.429 rows=113 loops=1)
                                                  Filter: (id = ANY ('{[...]}'::integer[]))
                                                  Rows Removed by Filter: 570
Planning Time: 25.226 ms
Execution Time: 6525.808 ms

NEW QUERY PLAN BRIN

  • 2 workers used
  • parallel seq scan instead of BRIN-index
Unique  (cost=2957786.28..2957956.13 rows=11323 width=51) (actual time=17276.310..17388.140 rows=22681 loops=1)
  ->  Sort  (cost=2957786.28..2957814.59 rows=11323 width=51) (actual time=17276.309..17369.897 rows=104607 loops=1)
        Sort Key: real_estate.number, flat.number, meter.mfct_code, meter.reading_serial, meter.type, datagram.captured_at DESC
        Sort Method: external merge  Disk: 5944kB
        ->  Gather  (cost=246962.35..2957023.85 rows=11323 width=51) (actual time=510.149..16634.476 rows=104607 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Hash Join  (cost=245962.35..2954891.55 rows=4718 width=51) (actual time=414.249..16845.834 rows=34869 loops=3)
                    Hash Cond: (datagram.src_meter = meter.id)
                    ->  Parallel Seq Scan on datagram  (cost=0.00..2708108.53 rows=206264 width=20) (actual time=0.653..16348.221 rows=171456 loops=3)
                          Filter: ((captured_at >= ('2020-08-28 10:34:32.855+02'::timestamp with time zone - '5 days'::interval)) AND (captured_at <= ('2020-08-28 10:34:32.855+02'::timestamp with time zone + '00:00:00'::interval)))
                          Rows Removed by Filter: 35983845
                    ->  Hash  (cost=245879.60..245879.60 rows=6620 width=39) (actual time=413.062..413.067 rows=40044 loops=3)
                          Buckets: 65536 (originally 8192)  Batches: 1 (originally 1)  Memory Usage: 3016kB
                          ->  Hash Join  (cost=145.59..245879.60 rows=6620 width=39) (actual time=9.850..400.265 rows=40044 loops=3)
                                Hash Cond: (meter.meter_bcd = meter_bcd.id)
                                ->  Index Scan using pk_meter on meter  (cost=0.42..245518.07 rows=40044 width=25) (actual time=8.971..384.016 rows=40044 loops=3)
                                      Index Cond: (id = ANY ('{[...]}'::integer[]))
                                ->  Hash  (cost=143.38..143.38 rows=143 width=22) (actual time=0.869..0.873 rows=113 loops=3)
                                      Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                      ->  Hash Join  (cost=125.05..143.38 rows=143 width=22) (actual time=0.743..0.855 rows=113 loops=3)
                                            Hash Cond: (meter_bcd.flat = flat.id)
                                            ->  Seq Scan on meter_bcd  (cost=0.00..13.65 rows=865 width=8) (actual time=0.039..0.105 rows=865 loops=3)
                                            ->  Hash  (cost=123.60..123.60 rows=116 width=22) (actual time=0.659..0.661 rows=113 loops=3)
                                                  Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                                  ->  Hash Join  (cost=110.72..123.60 rows=116 width=22) (actual time=0.567..0.643 rows=113 loops=3)
                                                        Hash Cond: (flat.real_estate = real_estate.id)
                                                        ->  Seq Scan on flat  (cost=0.00..11.03 rows=703 width=9) (actual time=0.026..0.077 rows=703 loops=3)
                                                        ->  Hash  (cost=109.30..109.30 rows=113 width=21) (actual time=0.490..0.491 rows=113 loops=3)
                                                              Buckets: 1024  Batches: 1  Memory Usage: 13kB
                                                              ->  Seq Scan on real_estate  (cost=0.00..109.30 rows=113 width=21) (actual time=0.209..0.469 rows=113 loops=3)
                                                                    Filter: (id = ANY ('{[...]}'::integer[]))
                                                                    Rows Removed by Filter: 570
Planning Time: 30.375 ms
Execution Time: 17391.867 ms

WHAT'S GOING ON?

I'm using the same query for both schemas with the same data and had expected that the new table using BTREE performs at least as good/bad as the query for the old schema. Though, things are slower and I can't see why e.g. no parallel workers are used at all. Additionally, the BRIN-index seems to be ignored entirely, even though other people use it for time based data, with rows having ordered IDs and stuff like that as well.

I would like to have BTREE-performance like with the old schema, but at best somewhat similar performance only using BRIN to consume less space.

Any idea what I'm doing wrong where? Thanks!

Best Answer

This looks like a time series database, with one huge table "datagram" that contains measurements with date "captured_at", source sensor "meter", value, and other info.

In this scenario, a common query is "where sensors in (list of sensors) and date between start_date and end_date" or "where sensors in (list of sensors) order by date desc limit something"

So the first thing is data locality. If rows are inserted into the table as they come, the order of rows on disk will probably be quite close to "order by date". This means each page contains rows from lots of different meters, so if you want to filter by a specific meter_id, whether an indexscan or a bitmapscan is used postgres will have to read tons of pages just to get a few rows out of each, which will slaughter your IO and your buffer cache. Please re-run the query with EXPLAIN(ANALYZE,BUFFERS) and you'll probably get some very high number of IOs.

It would be much more practical if the table was ordered by (meter_id,date) so when pg wants rows for one meter_id, and grabs a page, lots of rows with the same meter_id are already there, so it doesn't need any extra IO. This will also supercharge your bitmap index scans, since there will be much less bits set to "1" in the bitmap if the rows you need are concentrated into few pages, rather than being thinly spread all over the table.

So you are going to have to do something like CREATE TABLE ... AS SELECT * FROM oldtable ORDER BY (meter_id,date), or redo your import, but add an ORDER BY clause. While you do this, please set work_mem (or maintenance_work_mem, I don't remember which one is used in this command) to a pretty high value like 25% of your available RAM minus shared_buffers, to make sure the sort has some room to work in. This is gonna take a while. It would be better to put the temporary files on a SSD.

Then, rebuild indexes, and check your query again, compare IO numbers from EXPLAIN(ANALYZE,BUFFERS) and you should get a much lower number of heap reads.

Now the other thing is you really want an index on (meter_id,date) or (date,meter_id), so it can satisfy both WHERE or JOIN conditions. Currently it only uses the index on date, and has to check all the rows for meter_id, which is slow. If you order your table by (meter_id,date) then an index on (meter_id,date) would be the best choice. Your JOIN should now use this index, and it should be a lot faster.

However, new data will be inserted in date order again, and you really don't want to reorder the whole table too often. So you should partition the table by date, have all old partitions ordered and pretty much read-only, and aim to keep the last partition in cache constantly. When the most recent partition grows larger than about half your shared_buffers, you order its contents with a CLUSTER command, shelve it, and create a new partition for new incoming rows. There are other ways to do this, with less locking, but you get the idea.

Speaking about that, your EXPLAIN says:

Sort Method: external merge Disk: 5,952kB

I guess work_mem must be set to a really low value for it to use a slow merge disk sort on such a tiny sort. You can set work_mem a bit higher so it uses a quicksort instead. That should speed up the sort.