Postgresql – Does PostgreSQL optimize adding columns with non-NULL DEFAULTs

alter-tableddloptimizationpostgresql

When adding NOT NULL columns with a DEFAULT value – does PostgreSQL optimize this operation?

In case the table has n rows, a not-optimized alter-table-add-column would yield n writes of the default value – which could be very painful, obviously. With the optimization the DB would instantaneously create the new column, store just one copy of the default value which would be returned when no non-default value is found for that column in a suitable index data structure.

For example Oracle 11g has such an optimization.

Best Answer

There is no such mechanism in PostgreSQL.

However, you can still avoid the excessive effects of such a table change.

The following statement acquires an access exclusive lock on the table for the duration of the statement/transaction:

ALTER TABLE your_table
    ADD COLUMN new_column integer NOT NULL DEFAULT 0;

This statement changes the catalog, then rewrites the whole table so that the new column contains the default value in all rows. If the table has many rows and being accessed frequently enough, this would cause some temporary problems.

To avoid it, try to hold the exclusive lock for as short as possible:

ALTER TABLE your_table
    ADD COLUMN new_column integer;
ALTER TABLE your_table
    ALTER COLUMN new_column SET DEFAULT 0;

As this is basically only a (actually two) change to the catalog (no data change happens), it will complete pretty fast. Then depending on your needs and table usage, you can update the new column to the default in one step or in batches, and when finished, set the column to NOT NULL.

Update about a wish coming true: PostgreSQL 11 will have this feature. See https://www.depesz.com/2018/04/04/waiting-for-postgresql-11-fast-alter-table-add-column-with-a-non-null-default/ for more.

Related Solutions

Postgresql – Altering a parent table in Postgresql 8.4 breaks child table defaults

Your problem is that when you add a new column to the_person, its child, the_person_two will have this field appended at the end of columns list (4th position), so after has_default column. See:

db=> \d temp_person
  Column   |       Type        |                            Modifiers                            
-----------+-------------------+-----------------------------------------------------------------
 person_id | integer           | not null default nextval('temp_person_person_id_seq'::regclass)
 name      | character varying | 
 foo       | text              | 

db=> \d temp_person_two 
   Column    |         Type         |                            Modifiers                            
-------------+----------------------+-----------------------------------------------------------------
 person_id   | integer              | not null default nextval('temp_person_person_id_seq'::regclass)
 name        | character varying    | 
 has_default | character varying(4) | not null default 'en'::character varying
 foo         | text                 |

So, when you execute this:

INSERT INTO temp_person_two VALUES ( NEW.* );

PostgreSQL will actually understand that you want to insert on the first three columns of temp_person_two (as NEW.* will expand to three values), generating something similar to this:

INSERT INTO temp_person_two(person_id,name,has_default)
VALUES ( NEW.person_id, NEW.name, NEW.foo );

So, temp_person_two.has_default will get the value of NEW.foo, which is NULL in your case.

The solution is to simply expand the column names:

INSERT INTO temp_person_two(person_id,name,foo)
VALUES ( NEW.person_id, NEW.name, NEW.foo );

or, you could also use this:

INSERT INTO temp_person_two(person_id,name,foo)
VALUES ( NEW.* );

But this is weak, as any changes on column positions may break your statements, so I'd recommend the first one.

EDIT:

So the conclusion and the lesson learned here is:

Always explicitly type the names of the columns and the values when issuing an INSERT command, in fact, when issuing any SQL command at all... =D

This will save you a lot of time solving problems like that in future.

PostgreSQL performance with (col = value or col is NULL)

Replacing NULL with a surrogate value may or may not help a little. NULL handling is a bit more expensive in indexes but, on the other hand, it's typically smaller in storage, which also impacts performance.

I expect a much greater impact from this: flip the order of index expressions:

CREATE INDEX ON big_table (val, time DESC)
WHERE flag & 2::smallint = 2::smallint;

Rule of thumb: index for equality first — then for ranges. See:

Multicolumn index and performance

To your consolation: val = ANY('{-1,5}') burns down to being syntax shorthand for (val = -1 OR val = 5), which is hardly better than (val IS NULL OR val = 5). (The more important factor is number of rows for NULL vs. -1 - the same in your case if you just replace NULL with -1.).

Also consider upgrading to a current version of Postgres. 9.5 is getting old, there have been various performance improvements for big tables.

To return only few columns, an index-only scan would be a possible optimization, but you need to return most of your 21 columns according to comments.

To save 8 bytes per row needlessly lost to alignment padding, reorder the columns of your demo table like this:

CREATE TABLE big_table (
   flag bigint NOT NULL,
   time timestamp with timezone NOT NULL,
   id integer PRIMARY KEY,
   val int
);

Smaller is faster overall. Now, that's obviously just a demo table, but the same principles apply to your actual table. See:

Sort

For a single val, Postgres can return pre-sorted data from the index directly. But for more than one value it has to merge equally many sorted sub-sets (one sorted set for val IS NULL, another one for val = 5 in your example), so another sort step on top of index access is inevitable. Pre-sorted sets from the index still can make it cheaper - and you need sorted index tuples in any case. The actual query plan also depends on the chosen index access method. It's trivial to return pre-sorted data from an index scan (or index-only scan). Not so much for a bitmap index scan.

Special case: very few `NULL` values, used all the time

Since you mentioned only a handful of NULL values for millions of rows I would add a tailored index for that special case:

CREATE INDEX ON big_table (time DESC)
WHERE flag & 2::smallint = 2::smallint
AND   val IS NULL;

Maybe even append all other columns of interest to this very tiny special index to get an index-only scan. (Depends on preconditions.) Even more so in Postgres 11 or later with a true covering index using the INCLUDE clause. Results from this and the other index are merged in a BitmapOr node, just like you see for multiple subset from the same index now. Postgres has precise estimates for the special case and it becomes completely irrelevant whether the special case is NULL or -1 or whatever. (Not that it mattered all that much to begin with.) See: