Most efficient way to UPDATE a billion rows in AWS Aurora PostgreSQL

aws-aurorapostgresqlpostgresql-performance

I'm using an AWS-managed Aurora PostgreSQL v15 instance as catalog for a large number of S3 objects. The level1_dataset table has about 2 billion rows and its schema has a metadata JSONB column. An old software bug caused the string null to be written to the metadata column (instead of leaving it empty) when no metadata were supposed to be written. About a billion rows contain the sting null and I want to clean it with:

UPDATE public.level1_dataset
SET "metadata" = NULL
WHERE "metadata"::text = 'null';

The database is hosted on a db.r6g.2xlarge with 8 vCPU cores and 64 GB memory. With this setup, and leaving all tuning to defaults, I'm getting about 42 seconds/million rows. Temporarily changing CPU cores and memory for this cleanup task is possible.

What is the most efficient way to proceed?

UPDATE: One of @laurenz-albe's approaches is to do it in batches. This is how I did it because, in my case, "id" is UUID, not integer. The SELECT has a 10% penalty in my use-case.

UPDATE public.level1_dataset
SET "metadata" = NULL
WHERE "id" IN (SELECT "id"
    FROM public.level1_dataset
    WHERE "metadata"::text = 'null'
    LIMIT 10000000);

Best Answer

The fastest way is probably

CREATE TABLE xy AS
SELECT NULLIF(metadata, 'null') AS metadata, ...
FROM level1_dataset;

DROP TABLE level1_dataset;

ALTER TABLE xy RENAME TO level1_dataset;

But that requires you to take down time.

Other than that, update in batches and VACUUM in between:

UPDATE public.level1_dataset
SET "metadata" = NULL
WHERE "metadata"::text = 'null'
AND id BETWEEN 1 AND 10000000;

VACUUM public.level1_dataset;

UPDATE public.level1_dataset
SET "metadata" = NULL
WHERE "metadata"::text = 'null'
AND id BETWEEN 10000001 AND 20000000;

VACUUM public.level1_dataset;

...