AWS Aurora PostgreSQL – Handling Mass Updates on 100 Million Rows

aws-auroraperformancepostgresql

During the migration of a 300 million rows table I have encountered a strange behaviour on our AWS Aurora database:

Around 25 minutes into the migration, the total IOPS of the database dropped to almost 0 and it seemed to be idling without waits and did not return – seemingly stuck in the operation.

How can this behaviour be explained? Is it related to Aurora and the read replicas? I do expect that the migration can run on Aurora as it did on another PostgreSQL RDS instance.

Quick facts:

AWS Aurora PostgreSQL database, 3 replicas (same instance/class/size), 2 reader, multi AZ.
Huge table: Migration affects ~100 million rows of ~330 million total rows.
Migration was split into 3 chunks of 100 million rows.
Dead tuples where cleaned up before migration with manual vacuum.
Migration ran successfully on dev environment (db copy on PostgreSQL RDS instance) in about 70 minutes

Chunk 1 query in question:

-- Expected affected rows: 15154380
UPDATE debtclaim_event
SET acceptance_mode = 'NORMAL',
    modifiedon      = now(),
    modifiedby      = 'system#FC-7868',
    optlock         = optlock + 1
WHERE type = 'ACCEPTED'
  AND id < 100000000;

related explain / analyse:

EXPLAIN ANALYSE SELECT * FROM debtclaim_event WHERE type = 'ACCEPTED' AND acceptance_mode IS NULL;

resulted in:

Seq Scan on debtclaim_event  (cost=0.00..10236490.40 rows=33147254 width=550) (actual time=7.737..141019.913 rows=33358994 loops=1)
  Filter: ((acceptance_mode IS NULL) AND (type = 'ACCEPTED'::text))
  Rows Removed by Filter: 265944598
Planning Time: 0.096 ms
Execution Time: 142877.726 ms

Further steps

I tried to reduce the chunk size to 10 million. This resulted in the first chunk passing (which had no updates) and the second chunk passing with ~1,4 milltion rows altered in about 3 minutes.

The third chunk showed a similar behavoiur, as it dropped total iops after about 6 minutes and then idling for 15 more minutes, not returning.

Besides that I did not try to optimize the migration further. A possible way forward could be a new table creation, instead of updating the original one, as described in another post.

More insights from AWS RDS

During the operation other sessions tried to access and alter the table in question, but those mostly waited:

Here is another graph depicting the operation in further detail:

Best Answer

While it was not quite clear what exactly caused this slowdown AWS support engineers hinted that it most likely was caused by some kind of limited bandwidth due to instance size (making @LaurenzAlbe 's call pretty accurate). The proposed solution was scaling the writer instance or the whole cluster to be sure.

Vertically scaling the cluster instances to .8xlarge allowed for more bandwidth and more bandwidth burst? credits. This enabled us to run a similar migration with hundreds of million of rows in under 1 hour.

Additionally the migration was optimized by making it a create table migration (as seen in another post: Optimizing bulk update performance in PostgreSQL) instead of updating in place, resulting in less DB operations and less dead tuples.

Related Solutions

PostgreSQL 9.4 – Query Performance Issues

Mostly, this seems to be a misunderstanding. According to your query plan, you are retrieving rows=1410288 and the query itself is not that slow. It does not "take 224.9 seconds":

Execution time: 697.753 ms

You could probably improve the performance of your query some more by increasing the locality of clustered data, i.e., CLUSTER (or pg_repack) your table based on your index domain_id_btree (that appears under the different name of test_index_9 in your query plan). Details here:

Configuring PostgreSQL for read performance

If you do that (arriving at a favorable physical order of rows) and if you mostly just INSERT rows (UPDATE / DELETE is rare) and since there seem to be many rows per domain_id you might also benefit from a BRIN index (new in Postgres 9.5).

(It's remarkable that your current index is used at all, given that you retrieve ~ 6% of all rows. This indicates that your data is mostly clustered on domain_id already - or cost settings / statistics in your DB are inaccurate.)

But that's only going to reduce the 700 ms Postgres needs to execute the query. The rest of your reported 224.9 seconds are spent somewhere else, probably network transfer and processing in client.

Now, if you could retrieve fewer rows or less data per row ...

Mysql – Why is the dev server faster than a much more powerful RDS

Looks like my initial theory was correct.

A standard RDS is capped at 3 IOPS per GB of storage, hence why my writes were hovering around 332 on a 100 GB instance. Not sure how they went over 300; maybe Amazon is not strict about it or maybe MySQL Workbench just had latency and calculated it slightly off.

Anyways we launched a new RDS with provisioned IOPS set to 1,000. Now the queries are running 3x faster than they were before.

It does seem strange to me that the limit is based on the storage size. Those two aren't necessarily directly related in my opinion.