Postgresql – Slow update performance

postgresqlpostgresql-performanceupdate

There is Postgres 10 (10.5 (Ubuntu 10.5-1.pgdg16.04+1)).
The server has 256 GB RAM.

shared_buffers = 64GB
effective_cache_size = 192GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
effective_io_concurrency = 200
work_mem = 111MB
min_wal_size = 4GB
max_wal_size = 16GB
max_worker_processes = 8

There is a table with about 133 mln rows. The table occupies 52 GB on disk, real data size is 26 GB.

DDL of the table:

CREATE TABLE tmd.t_metadata (
    id bigserial NOT NULL,
    hash uuid NOT NULL,
    parent_hash uuid NULL,
    dataset_id int2 NOT NULL,
    status_id int2 NULL,
    create_ts timestamp NOT NULL,
    update_ts timestamp NOT NULL,
    source_path varchar(1024) NULL,
    target_path varchar(1024) NULL,
    info jsonb NULL,
    file_modified timestamptz NULL,
    CONSTRAINT t_metadata_pk PRIMARY KEY (id),
    CONSTRAINT t_metadata_ux UNIQUE (hash)
);
CREATE INDEX metadata_file_modified_ix ON tmd.t_metadata USING btree (file_modified);

CREATE INDEX metadata_search_ix ON tmd.t_metadata USING btree (dataset_id) WHERE ((parent_hash IS NULL) AND (status_id = ANY (ARRAY[0, 1])));

CREATE INDEX metadata_search_old_ix ON tmd.t_metadata USING btree (dataset_id) WHERE ((parent_hash IS NULL) AND (file_modified IS NULL));

CREATE INDEX t_metadata_file_modified_filter_ix ON tmd.t_metadata USING btree (dataset_id, file_modified) WHERE ((status_id = ANY (ARRAY[0, 1])) 
AND (parent_hash IS NOT NULL));
CREATE INDEX t_metadata_file_modified_ix ON tmd.t_metadata USING btree (file_modified) WHERE (parent_hash IS NULL);
CREATE UNIQUE INDEX t_metadata_hash_uindex ON tmd.t_metadata USING btree (hash);

And there is query:

UPDATE tmd.t_metadata 
  SET status_id=2, 
      update_ts='2020-02-07T13:45:46.519667'::timestamp 
WHERE tmd.t_metadata.dataset_id = '20' 
  AND tmd.t_metadata.status_id IN (0) 
  AND tmd.t_metadata.file_modified > '2017-01-01' 
  AND tmd.t_metadata.file_modified < '2019-01-11' 
  AND tmd.t_metadata.dataset_id = '20' 
  AND tmd.t_metadata.parent_hash IS NOT null;

The query use index tmd.t_metadata_file_modified_filter_ix for work:

QUERY PLAN                                                                                                                                                                                                                                                     |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
Update on t_metadata  (cost=747718.24..4804650.70 rows=25992028 width=293)                                                                                                                                                                                     |
  ->  Bitmap Heap Scan on t_metadata  (cost=747718.24..4804650.70 rows=25992028 width=293)                                                                                                                                                                     |
        Recheck Cond: ((dataset_id = '20'::smallint) AND (file_modified > '2017-01-01 00:00:00+03'::timestamp with time zone) AND (file_modified < '2019-01-11 00:00:00+03'::timestamp with time zone) AND (status_id = ANY ('{0,1}'::integer[])) AND (parent_h|
        Filter: (status_id = 0)                                                                                                                                                                                                                                |
        ->  Bitmap Index Scan on t_metadata_file_modified_filter_ix  (cost=0.00..741220.23 rows=26007973 width=0)                                                                                                                                              |
              Index Cond: ((dataset_id = '20'::smallint) AND (file_modified > '2017-01-01 00:00:00+03'::timestamp with time zone) AND (file_modified < '2019-01-11 00:00:00+03'::timestamp with time zone))                                                    |

It updates about 100k rows.
When I tested it, it works for about 30 seconds.
And I see it works now for more than 2 hours.
It shows it waits on IO, on DataFileRead.
The same happened many times it passed. I log the execution plan to PostgreSQL log, it is the same as during my tests.

Execution plan from PostgreSQL log:

2020-02-07 17:28:04.839 [24465] dataware@dev_zakupki-44 LOG:  duration: 2538066.932 ms  plan:
        Query Text: UPDATE tmd.t_metadata SET status_id=2, update_ts='2020-02-07T13:45:46.519667'::timestamp WHERE tmd.t_metadata.dataset_id = '20' AND tmd.t_metadata.status_id IN (0) AND tmd.t_metadata.file_mod
ified > '2017-01-01' AND tmd.t_metadata.file_modified < '2019-01-11' AND tmd.t_metadata.dataset_id = '20' AND tmd.t_metadata.parent_hash IS NOT NULL
        Update on tmd.t_metadata  (cost=747718.24..4804650.70 rows=25992028 width=293) (actual time=2538066.916..2538066.916 rows=0 loops=1)
          Buffers: shared hit=2012810 read=206960 dirtied=201143
          ->  Bitmap Heap Scan on tmd.t_metadata  (cost=747718.24..4804650.70 rows=25992028 width=293) (actual time=12.003..300.566 rows=103148 loops=1)
                Output: id, hash, parent_hash, dataset_id, '2'::smallint, create_ts, '2020-02-07 13:45:46.519667'::timestamp without time zone, source_path, target_path, info, file_modified, ctid
                Recheck Cond: ((t_metadata.dataset_id = '20'::smallint) AND (t_metadata.file_modified > '2017-01-01 00:00:00+03'::timestamp with time zone) AND (t_metadata.file_modified < '2019-01-11 00:00:00+03
'::timestamp with time zone) AND (t_metadata.status_id = ANY ('{0,1}'::integer[])) AND (t_metadata.parent_hash IS NOT NULL))
                Filter: (t_metadata.status_id = 0)
                Heap Blocks: exact=2041
                Buffers: shared hit=2443
                ->  Bitmap Index Scan on t_metadata_file_modified_filter_ix  (cost=0.00..741220.23 rows=26007973 width=0) (actual time=11.623..11.623 rows=103148 loops=1)
                      Index Cond: ((t_metadata.dataset_id = '20'::smallint) AND (t_metadata.file_modified > '2017-01-01 00:00:00+03'::timestamp with time zone) AND (t_metadata.file_modified < '2019-01-11 00:00:0
0+03'::timestamp with time zone))
                      Buffers: shared hit=402

Similar happens to some other inserts, they insert 300k-500k rows in a single statement. They work for hours. But when I test them on the same server with the same data, they usually work fast.
When applications started on the server and they start to load data into the database, performance decrease by many times. This is a dev server, so practically anything can be done in the configuration for testing, but the same problem happens on prod servers.

Practically all waits in kernel in io write operations.

I quite not understand, why PostgreSQL writes something to disk if the transaction is not committed yet? I think data for the uncommitted transactions should be written by the background writer?

It looks to me, what backend processes wait on writes. Normally they should not do it? Probably, waits on flush due to dirty_ratio? dirty_ratio is 20, background_dirty_ratio is 5. May we change something here?

How the performance of the queries can be improved?


Created account and the initial post was written without the creation of an account, I do not know how to link them.

About testing. Yes, I tested the query on the same server.
What was done: at a time when the long update worked, I copied the content of the table to a separate table, with the usage of select into. Next, applied the update to that table. It worked fast, about 30 sec.

As for the question about adding status_id to index – it is already, index with where condition. As one can see, the select part works fast. Slow is part with insert.

So, looking at answers, I see the next items to try:

  1. Test what will happen if set fills factor to indices to something like 50%, reindex, and see the result. May it improve the situation? How, if tables in PostgreSQL are heaps and fill factor should not affect heaps, only indices?
  2. Try to split updates into smaller chunks. How it is possible to get improvement in speed here if it would be necessary to vacuum the table between chunks?
  3. For big inserts (inserts of 400k-500k rows into a table with about 100 mln rows) instead of the usage of insert.. select .. where not exists(..)
    use another pattern: store result of select into temp unlogged table, next use copy to store it in the main table. Is it a widely used pattern?
  4. I would test unlogged, how it affects performance. But I can not make the table unlogged in production, only in a dev environment
  5. I cannot drop all indices on the target table during inserts, because there are queries to data in the table.

About IO waits.
Yes, first come from pg_stat_activity, second from perf utility.

About shared buffers. But the server has 256GB memory, only PostgreSQL uses the server, shared_buffers set to 64GB memory, so at least all indices should fit into memory.

The IO system is quite slow. Several 10k rpm disks in RAID10, about 2000 IOPS random read and 700 IOPS random writes. I consider proposing to upgrade to SSD drives but is necessary to understand the expected outcome. Will the queries become faster twice or by 10+ times? I don't know. The total size of DB is about 1.2 TB.

I guess the problem with the mass load of data and mass update of data is quite common for PostgreSQL. Are any papers about it?

And thanks to all who answered for your time.

Best Answer

It shows it waits on IO, on DataFileRead.

Practically all waits in kernel in io write operations.

These seem contradictory. I assume the first conclusion is coming from pg_stat_activity. Where is the second one coming from?

I quite not understand, why PostgreSQL write something to disk if transaction is not committed yet?

There is only so much room for dirty buffers. So you have to write stuff out eventually to make room to read different stuff in. In general transaction management and buffer management are only loosely coupled. Or at least, coupled in ways other than what your mental model says. The info that lets you decide whether each tuple was committed are part of the buffer/page itself, and get written out along with it, and then read back in with it. They are not memory-resident-only. And at commit time, you only need to write out and flush the WAL you generated, not all the other data you dirtied.

Increasing shared_buffers can sometimes help this, making more room for dirty buffers to accumulate before it is forced to write them. If all of your indexes can fit in shared_buffers, this can really help because then you can dirty the same index pages over and over, but with gaps in time between the dirtying, without them being written in between. If you can't make shared_buffers that large, then the same amount of data still needs to get written eventually, so you are likely just moving the pain around, rather than fixing it. You might be able to get improvements there by changing the *_flush_after parameters. I've never had much luck with them, but other people have.

I think data for uncommitted transaction should be written by background writer?

If the background writer can't keep up, then the ordinary backends do their own writes. In general, I find the background writer pretty useless. If writes are being freely absorbed by the kernel, then they are so fast there is no reason to offload that task to the background writer. And if writes are blocking in the kernel, then the background writer won't be able to keep up and so the backends have to do it themselves anyway.

Probably, waits on flush due to dirty_ratio? dirty_ratio is 20, background_dirty_ratio is 5. May be change something here?

All your dirty data needs to get written to disk eventually. Tweaking these settings, you might be able to get your updates to go faster, only for your checkpoints to seize up for extended periods once it starts issuing sync requests.

Maybe you just need storage with faster writes. What is your IO system currently like?