PostgreSQL – Best Way to Populate a New Column in a Large Table

ddlpostgresqlstorage

We have a 2.2 GB table in Postgres with 7,801,611 rows in it. We are adding a uuid/guid column to it and I am wondering what the best way to populate that column is (as we want to add a NOT NULL constraint to it).

If I understand Postgres correctly an update is technically a delete and insert so this is basically rebuilding the entire 2.2 gb table. Also we have a slave running so we don't want that to lag behind.

Is there any way better than writing a script that slowly populates it over time?

Best Answer

It very much depends on details of your setup and requirements.

Note that since Postgres 11, only adding a column with a volatile DEFAULT still triggers a table rewrite. Unfortunately, this is your case.

If you have sufficient free space on disk - at least 110 % of pg_size_pretty((pg_total_relation_size(tbl)) - and can afford a share lock for some time and an exclusive lock for a very short time, then create a new table including the uuid column using CREATE TABLE AS. Why?

What causes large INSERT to slow down and disk usage to explode?

The below code uses a function from the additional uuid-oss module.

Lock the table against concurrent changes in SHARE mode (still allowing concurrent reads). Attempts to write to the table will wait and eventually fail. See below.
Copy the whole table while populating the new column on the fly - possibly ordering rows favorably while being at it.
If you are going to reorder rows, be sure to set work_mem high enough to do the sort in RAM or as high as you can afford (just for your session, not globally).
Then add constraints, foreign keys, indices, triggers etc. to the new table. When updating large portions of a table it is much faster to create indices from scratch than to add rows iteratively. Related advice in the manual.
When the new table is ready, drop the old and rename the new to make it a drop-in replacement. Only this last step acquires an exclusive lock on the old table for the rest of the transaction - which should be very short now.
It also requires that you delete any object depending on the table type (views, functions using the table type in the signature, ...) and recreate them afterwards.
Do it all in one transaction to avoid incomplete states.

BEGIN;
LOCK TABLE tbl IN SHARE MODE;

SET LOCAL work_mem = '???? MB';  -- just for this transaction

CREATE TABLE tbl_new AS 
SELECT uuid_generate_v1() AS tbl_uuid, <list of all columns in order>
FROM   tbl
ORDER  BY ??;  -- optionally order rows favorably while being at it.

ALTER TABLE tbl_new
   ALTER COLUMN tbl_uuid SET NOT NULL
 , ALTER COLUMN tbl_uuid SET DEFAULT uuid_generate_v1()
 , ADD CONSTRAINT tbl_uuid_uni UNIQUE(tbl_uuid);

-- more constraints, indices, triggers?

DROP TABLE tbl;
ALTER TABLE tbl_new RENAME tbl;

-- recreate views etc. if any
COMMIT;

This should be fastest. Any other method of updating in place has to rewrite the whole table as well, just in a more expensive fashion. You would only go that route if you don't have enough free space on disk or cannot afford to lock the whole table or generate errors for concurrent write attempts.

What happens to concurrent writes?

Other transaction (in other sessions) trying to INSERT / UPDATE / DELETE in the same table after your transaction has taken the SHARE lock, will wait until the lock is released or a timeout kicks in, whichever comes first. They will fail either way, since the table they were trying to write to has been deleted from under them.

The new table has a new table OID, but concurrent transaction have already resolved the table name to the OID of the previous table. When the lock is finally released, they try to lock the table themselves before writing to it and find that it's gone. Postgres will answer:

ERROR: could not open relation with OID 123456

Where 123456 is the OID of the old table. You need to catch that exception and retry queries in your app code to avoid it.

If you cannot afford that to happen, you have to keep your original table.

Keeping the existing table, alternative 1

Update in place (possibly running the update on small segments at a time) before you add the NOT NULL constraint. Adding a new column with NULL values and without NOT NULL constraint is cheap.
Since Postgres 9.2 you can also create a CHECK constraint with NOT VALID:

The constraint will still be enforced against subsequent inserts or updates

That allows you to update rows peu à peu - in multiple separate transactions. This avoids keeping row locks for too long and it also allows dead rows to be reused. (You'll have to run VACUUM manually if there is not enough time in between for autovacuum to kick in.) Finally, add the NOT NULL constraint and remove the NOT VALID CHECK constraint:

ALTER TABLE tbl ADD CONSTRAINT tbl_no_null CHECK (tbl_uuid IS NOT NULL) NOT VALID;

-- update rows in multiple batches in separate transactions
-- possibly run VACUUM between transactions

ALTER TABLE tbl ALTER COLUMN tbl_uuid SET NOT NULL;
ALTER TABLE tbl ALTER DROP CONSTRAINT tbl_no_null;

Related answer discussing NOT VALID in more detail:

Disable all constraints and table checks while restoring a dump

Keeping the existing table, alternative 2

Prepare the new state in a temporary table, TRUNCATE the original and refill from the temp table. All in one transaction. You still need to take a SHARE lock before preparing the new table to prevent losing concurrent writes.

Details in these related answer on SO:

Basic answers

Since you select a couple of big columns an index-only scan is probably not a viable option.

This code works (if no NULL values in data!)

While the column isn't defined NOT NULL, add NULLS LAST to the sort order to make it work in any case, even with NULL values. Ideally, use the clause in the corresponding index as well:

SELECT <some big columns>
FROM   my_table_ 
ORDER  BY when_ DESC NULLS LAST
LIMIT  1;

PostgreSQL sort by datetime asc, null first?

Without any index on when_ column, does this statement require a full scan of all rows?

Yes. Without index, there is no other option left. (Well, there is also table partitioning where an index on key columns(s) is not strictly required, and it could assist with partition pruning. But you would typically have an index on key columns there, too.)

With an index on when_ column, should I change this SQL to use some other approach/strategy of query?

Basically, this is the perfect query. There are options in combination with advanced indexing:

Advanced technique

Assuming a NOT NULL column. Else, add NULLS LAST to index and queries as suggested above.

You have a constant influx of rows with later when_. Assuming the latest _when constantly increases and never (or rarely) decreases (latest rows deleted / updated), you can use a very small partial index.

Basic implementation:

Run your query once to retrieve the latest when_, subtract a safe margin (to be safe against losing the latest rows) and create an IMMUTABLE function based on it. Basically a "fake global constant":
```
CREATE OR REPLACE FUNCTION f_when_cutoff()
  RETURNS timestamptz LANGUAGE sql COST 1 IMMUTABLE PARALLEL SAFE AS
$$SELECT timestamptz '2015-07-25 01:00+02'$$;
```
PARALLEL SAFE only in Postgres 9.6 or later.
Create a partial index excluding older rows:
```
CREATE INDEX my_table_when_idx ON my_table_ (when_ DESC)
WHERE when_ > f_when_cutoff();
```
With millions of rows, the difference in size can be dramatic. And this only makes sense with a much smaller index. Just half the size or something would not cut it. Index access itself is not slowed much by a bigger index. It's mostly the sheer size of the index, which needs to be read and cached. (And possibly avoiding additional index writes, but hardly in your case.)
Use the function in all related queries. Include the same WHERE condition (even if logically redundant) to convince the query planner the index is applicable. For the simple query:
```
SELECT <some big columns>
FROM   my_table_ 
WHERE  when_ > f_when_cutoff()
ORDER  BY when_ DESC
LIMIT  1;
```

The size of the index grows with new (later) entries. Recreate the function with a later timestamp and REINDEX from time to time with no or little concurrent access. Only reindex after a relevant number of rows has been added. A couple of thousand entries won't matter much. We are doing this to cut off millions.
The beauty of it: queries don't change.

Implementation with function to update the partial index automatically:

Get latest child per parent from big table - query is too slow

More general advice:

Index optimization with dates

Best Answer

What happens to concurrent writes?

Keeping the existing table, alternative 1

Keeping the existing table, alternative 2

Related Solutions

PostgreSQL Join – Inner Join on Array Column

PostgreSQL Query Performance – How to Get Row with Latest Timestamp

Basic answers

Advanced technique

Related Question