Postgresql – Finding empty columns of a table in PostgreSQL

database-designnullpostgresqlschema

What query would return the name of the columns of a table where all rows are NULL?

Best Answer

testbed:

create role stack;
create schema authorization stack;
set role stack;

create table my_table as 
select generate_series(0,9) as id, 1 as val1, null::integer as val2;

create table my_table2 as 
select generate_series(0,9) as id, 1 as val1, null::integer as val2, 3 as val3;

function:

create function has_nonnulls(p_schema in text, p_table in text, p_column in text)
                returns boolean language plpgsql as $$
declare 
  b boolean;
begin
  execute 'select exists(select * from '||
          p_table||' where '||p_column||' is not null)' into b;
  return b;
end;$$;

query:

select table_schema, table_name, column_name, 
       has_nonnulls(table_schema, table_name, column_name)
from information_schema.columns
where table_schema='stack';

result:

 table_schema | table_name | column_name | has_nonnulls
--------------+------------+-------------+--------------
 stack        | my_table   | id          | t
 stack        | my_table   | val1        | t
 stack        | my_table   | val2        | f
 stack        | my_table2  | id          | t
 stack        | my_table2  | val1        | t
 stack        | my_table2  | val2        | f
 stack        | my_table2  | val3        | t
(7 rows)

In addition you can get an approximate answer by querying the catalog - if null_frac is zero that indicates no nulls but should be double-checked against 'real' data:

select tablename, attname, null_frac from pg_stats where schemaname='stack';

 tablename | attname | null_frac
-----------+---------+-----------
 my_table  | id      |         0
 my_table  | val1    |         0
 my_table  | val2    |         1
 my_table2 | id      |         0
 my_table2 | val1    |         0
 my_table2 | val2    |         1
 my_table2 | val3    |         0
(7 rows)

Related Solutions

PostgreSQL pl/perl trigger, differentiate null vs empty

You did not specify PostgreSQL version. On 9.0.5 I have observed the same behavior (not sure if it's a bug, see comments below).

This is easy to workaround - you can first test for definedness to catch NULL, and if it passed, test for empty string.

if ( not defined $_TD->{ new }{ $column } ) {
    elog( NOTICE, "New value in column $column is not defined" );
}
elsif ( $_TD->{ new }{ $column } eq '' ) {
    elog( NOTICE, "New value in column $column is empty" );
}
if ( not defined $_TD->{ old }{ $column } ) {
    elog( NOTICE, "Old value in column $column is not defined" );
}
elsif ( $_TD->{ old }{ $column } eq '' ) {
    elog( NOTICE, "Old value in column $column is empty" );
}

PostgreSQL – How Empty Columns in Large Table Affect Performance

Since the columns e, k, and n can be NULL, I assume "100% empty" means NULL.

NULL storage is cheap. Each NULL "costs" one bit in the null bitmap for storage and otherwise hardly affects performance. Effective storage requirements depend on whether the null bitmap for each row already exists and has room for 3 more bits.

Tables with up to 8 user columns utilize a spare byte after the tuple header for the null bitmap in the current implementation. For any more, another MAXALIGN (typically 8) bytes are allocated providing for another 64 columns (for a total 72 now). Etc. So the null bitmap effectively costs 8 bytes per row for most tables, including yours which has 18 columns.

The null bitmap is allocated completely or not at all per row and only if there is at least one actual NULL value. If all columns are defined NOT NULL, there can never be one.

In the absence of other, more important considerations, move columns that are always (or mostly) NULL to the end of the row. Helps performance a tiny bit.

Answer to question 1.

The effective storage requirement is:

0 bytes per row where other columns are also NULL.
8 bytes (typically) per row where no other column is NULL, for allocating the null bitmap

Answer to question 2.

Dropping would not buy you much. Page cache would be the same or slightly less depending on 1. Handling queries, backups, source code and such would be simplified.

What you did not ask for

After a round of "column tetris" I suggest this table layout - again barring other, more important considerations:

  id serial NOT NULL
, a  integer
, g  timestamp
, i  timestamp
, b  integer
, c  integer
, h  integer
, j  integer
, l  integer
, d  smallint
, f  smallint
, m  smallint
, o  varchar(36)
, p  varchar(100)
, q  varchar(100)
, k  varchar(32)      --  always NULL
, n  smallint         --  always NULL
, e  timestamp        --  always NULL

Saves at least 16 bytes per row compared to your original layout, probably around 20+, depending on your varchar columns. Instructions for "column tetris" here:

Calculating and saving space in PostgreSQL

How to reorder columns

The simplest method is to create a new table and copy data into it. Then you get a pristine new table without dead columns (and dead rows). You could sort (cluster) the physical order of rows in the process, too.

Changes in place are possible too, but there is no convenient tool (I would know of) to generally reorder the sequence of columns. Depending objects like views, functions (depends), FK constraints, indexes etc. are limiting factor here. Those can block you from deleting a column, you have to drop and recreate depending objects.

And of course, concurrent access to the table conflicts with DDL commands changing it. You need exclusive table locks in this case.

It's simple and cheap to delete completely empty columns and add them at the end of the row. Switching the sequence of populated columns in the middle of the table is not as simple or cheap. I would create a new table for that.

This related answer has recipes and links to more:

Best way to populate a new column in a large table?