Most efficient way to UPDATE a billion rows in AWS Aurora PostgreSQL

aws-aurorapostgresqlpostgresql-performance

I'm using an AWS-managed Aurora PostgreSQL v15 instance as catalog for a large number of S3 objects. The level1_dataset table has about 2 billion rows and its schema has a metadata JSONB column. An old software bug caused the string null to be written to the metadata column (instead of leaving it empty) when no metadata were supposed to be written. About a billion rows contain the sting null and I want to clean it with:

UPDATE public.level1_dataset
SET "metadata" = NULL
WHERE "metadata"::text = 'null';

The database is hosted on a db.r6g.2xlarge with 8 vCPU cores and 64 GB memory. With this setup, and leaving all tuning to defaults, I'm getting about 42 seconds/million rows. Temporarily changing CPU cores and memory for this cleanup task is possible.

What is the most efficient way to proceed?

UPDATE: One of @laurenz-albe's approaches is to do it in batches. This is how I did it because, in my case, "id" is UUID, not integer. The SELECT has a 10% penalty in my use-case.

UPDATE public.level1_dataset
SET "metadata" = NULL
WHERE "id" IN (SELECT "id"
    FROM public.level1_dataset
    WHERE "metadata"::text = 'null'
    LIMIT 10000000);

Best Answer

The fastest way is probably

CREATE TABLE xy AS
SELECT NULLIF(metadata, 'null') AS metadata, ...
FROM level1_dataset;

DROP TABLE level1_dataset;

ALTER TABLE xy RENAME TO level1_dataset;

But that requires you to take down time.

Other than that, update in batches and VACUUM in between:

UPDATE public.level1_dataset
SET "metadata" = NULL
WHERE "metadata"::text = 'null'
AND id BETWEEN 1 AND 10000000;

VACUUM public.level1_dataset;

UPDATE public.level1_dataset
SET "metadata" = NULL
WHERE "metadata"::text = 'null'
AND id BETWEEN 10000001 AND 20000000;

VACUUM public.level1_dataset;

...

Related Solutions

PostgreSQL most efficient way to reference multiple tables

Both A and B inherit a sequence from a "parent" table and C has a foreign key to the "parent" table, but this is not possible with PostgreSQL

This is possible with any SQL database. Why do you think it's not possible with PostgreSQL?

You're probably confusing SQL Table Inheritance with PostgreSQL's "table inheritance", which are different things

Here's how to do it in SQL. This is called Class Table Inheritance:

create table parties (
  party_id serial primary key
);

create table individuals (
  party_id int primary key references parties(party_id),
  given_name text,
  ...
);

create table organizations (
  party_id int primary key references parties(party_id),
  organization_name text,
  ...
);

create table sales_orders (
  order_id serial primary key,
  customer_id int references parties(party_id),
  ...
);

If you want faster performance, use Single Table Inheritance:

create table parties (
  party_id serial primary key,
  party_type text check(party_type in ('Individual','Organization')),
  given_name text null,
  organization_name text null,
  ...
  /* add some check constraints for party_type vs values for individuals/orgs */
);

Also, it'd be about 100x easier to help you if you gave real examples instead of abstract "A" and "B"

PostgreSQL JSONB Array vs Split Rows – Performance Comparison

The optimal DB design always depends on the complete picture.

Generally, there is hardly anything faster than a plain btree index for your simple query. Introducing json or jsonb or even a plain array type in combination with a GIN index will most likely make it slower.

With your original table this multicolumn index with the right sort order should be a game changer for your common query:

CREATE INDEX game_changer ON actions (receiver, time DESC);

This way, Postgres can just pick the top 100 rows from the index directly. Super fast.

Optimizing queries on a range of timestamps (two columns)

Your current indexes receiver_idx and actions_time_idx may lose their purpose.

Next to the perfect index, storage size is an important factor for big tables, so avoiding duplication may be the right idea. But that can be achieved in various ways. Have you considered good old normalization, yet?

CREATE TABLE receiver (
   receiver_id serial PRIMARY KEY
 , receiver    text NOT NULL -- UNIQUE?
);

CREATE TABLE action (  -- I shortened the name to "action"
   action_id   bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
   -- global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass),  -- ??
   time       timestamptz NOT NULL DEFAULT now(),
   block_num  int NOT NULL,
   tx_id      text NOT NULL,
   contract   text NOT NULL,
   action     text NOT NULL,
   data       jsonb NOT NULL
)

CREATE TABLE receiver_action (
   receiver_id int    REFERENCES receiver
 , action_id   bigint REFERENCES action
 , PRIMARY KEY (receiver_id, action_id)
);

Also note the changed order of columns in table action, saves a couple of bytes per row, which makes a couple of GB for billions of rows.

See:

Your common query changes slightly to:

SELECT a.*
FROM   receiver_action ra
JOIN   action a USING (action_id)
WHERE  ra. receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
ORDER  BY a.time DESC
LIMIT  100;

Drawback: it's much harder to make your common query fast now. Related:

Can spatial index help a "range - order by - limit" query

The quick (and slightly dirty) fix: include the time column in table receiver_action redundantly (or move it there).

CREATE TABLE receiver_action (
   receiver_id int    REFERENCES receiver
 , action_id   bigint REFERENCES action
 , time        timestamptz NOT NULL DEFAULT now()  -- !
 , PRIMARY KEY (receiver_id, action_id)
);

Create an index:

CREATE INDEX game_changer ON receiver_action (receiver_id, time DESC) INCLUDE (action_id);

INCLUDE requires Postgres 11 or later. See:

Does a query with a primary key and foreign keys run faster than a query with just primary keys?

And use this query:

SELECT a.*
FROM  (
   SELECT action_id
   FROM   receiver_action
   WHERE  receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
   ORDER  BY time DESC
   LIMIT  100
   )
JOIN   action a USING (action_id);

Depending on the exact story behind one set of data may create 3 separate rows more may be possible - even 3 separate columns in table action instead of the n:m implementation and a expression GIN index ... But that's going in too deep. I tap out here.

Best Answer

Related Solutions

PostgreSQL most efficient way to reference multiple tables

PostgreSQL JSONB Array vs Split Rows – Performance Comparison

Related Question