PostgreSQL performance with (col = value or col is NULL)

index-tuningnullperformancepostgresqlpostgresql-performance

This question deals with PostgreSQL 9.5 query performance.

The table is:

CREATE TABLE big_table
(
   id integer NOT NULL,
   flag bigint NOT NULL,
   time timestamp with timezone NOT NULL,
   val int,
   primary key (id)
)

Consider the query:

SELECT * 
FROM big_table 
WHERE time > '0666-06-06 00:00:00+00' AND 
      flag & 2 = 2 AND 
      (val = 5 or val IS NULL) 
ORDER by time desc

The goal is to make this query performance as good as possible while not using a union. I achieved the best performance using the index:

CREATE INDEX ind 
ON big_table USING btree (time DESC, val)
WHERE (flag & 2::smallint = 2::smallint);

which makes the query use the index twice and use bitmap or between them. This is still much worse than if I could do val = ANY('{NULL,5}') which is not possible.

This makes me consider to put some special value such as -1 instead of NULL which will allow me to do val = ANY('{-1,5}') to run a full index scan with better performance.

So is using some special value instead of NULL better in such situations? Or does PostgreSQL has some optimization for situation where you want to do an "OR" between a non-NULL value and a NULL?

I read this great article but didn't see there a solution for what I just asked without using a union: https://www.cybertec-postgresql.com/en/avoid-or-for-better-performance/

I also searched for such question on Stack Overflow and here but the related questions either don't talk about performance or don't talk about non-NULL together with NULL. This similar question (and the answer it received) spoke about SQL Server and not PostgreSQL: Best way to write SQL Query that checks a column for non-NULL value or NULL

I also considered doing coalesce to get rid of null but have read that it makes the performance much worse.

The table has more than 4 million records. From them only 56 records have NULL in val column.

UPDATE:
Original explain:

"Limit  (cost=1112.32..1112.32 rows=2 width=519)"
"  ->  Sort  (cost=1112.32..1112.32 rows=2 width=519)"
"        Sort Key: time DESC"
"        ->  Bitmap Heap Scan on big_table  (cost=1104.29..1112.31 rows=2 width=519)"
"              Recheck Cond: (((time > '0666-06-06 00:00:00+00'::timestamp with time zone) AND (val = 3) AND ((flag & '2'::smallint) = '2'::smallint)) OR ((time > '0666-06-06 00:00:00+00'::timestamp with time zone) AND (val IS NULL) AND ((flag & '2'::smallint) = '2'::smallint)))"
"              ->  BitmapOr  (cost=1104.29..1104.29 rows=2 width=0)"
"                    ->  Bitmap Index Scan on ind  (cost=0.00..552.14 rows=1 width=0)"
"                          Index Cond: ((time > '0666-06-06 00:00:00+00'::timestamp with time zone) AND (val = 3))"
"                    ->  Bitmap Index Scan on ind  (cost=0.00..552.14 rows=1 width=0)"
"                          Index Cond: ((time > '0666-06-06 00:00:00+00'::timestamp with time zone) AND (val IS NULL))"

Explain after changing (val = 3 or val IS NULL) to coalesce(val,-1) = ANY('{3,-1}') and creating the index:

CREATE INDEX ind2 ON big_table USING btree (coalesce(val,-1), time DESC) WHERE (flag & 2::smallint = 2::smallint);

is:

"Limit  (cost=863.45..863.50 rows=20 width=519)"
"  ->  Sort  (cost=863.45..863.99 rows=216 width=519)"
"        Sort Key: time DESC"
"        ->  Bitmap Heap Scan on big_table  (cost=11.08..857.71 rows=216 width=519)"
"              Recheck Cond: ((COALESCE(val, '-1'::integer) = ANY ('{3,-1}'::integer[])) AND (time > '0666-06-06 00:00:00+00'::timestamp with time zone) AND ((flag & '2'::smallint) = '2'::smallint))"
"              ->  Bitmap Index Scan on ind2  (cost=0.00..11.03 rows=216 width=0)"
"                    Index Cond: ((COALESCE(val, '-1'::integer) = ANY ('{3,-1}'::integer[])) AND (time > '0666-06-06 00:00:00+00'::timestamp with time zone))"

Running original query but changing the order of time and val columns in the original index gives the following much better explain plan:

"Limit  (cost=16.92..16.92 rows=2 width=519)"
"  ->  Sort  (cost=16.92..16.92 rows=2 width=519)"
"        Sort Key: time DESC"
"        ->  Bitmap Heap Scan on big_table  (cost=8.89..16.91 rows=2 width=519)"
"              Recheck Cond: (((val = 3) AND (time > '0666-06-06 00:00:00+00'::timestamp with time zone) AND ((flag & '2'::smallint) = '2'::smallint)) OR ((val IS NULL) AND (time > '0666-06-06 00:00:00+00'::timestamp with time zone) AND ((flag & '2'::smallint) = '2'::smallint)))"
"              ->  BitmapOr  (cost=8.89..8.89 rows=2 width=0)"
"                    ->  Bitmap Index Scan on ind  (cost=0.00..4.44 rows=1 width=0)"
"                          Index Cond: ((val = 3) AND (time > '0666-06-06 00:00:00+00'::timestamp with time zone))"
"                    ->  Bitmap Index Scan on ind  (cost=0.00..4.44 rows=1 width=0)"
"                          Index Cond: ((val IS NULL) AND (time > '0666-06-06 00:00:00+00'::timestamp with time zone))"

And actually now doing the coalesce query actually slows down the query (even with columns reversed but replaced with coalesce in the index)!

Best Answer

Replacing NULL with a surrogate value may or may not help a little. NULL handling is a bit more expensive in indexes but, on the other hand, it's typically smaller in storage, which also impacts performance.

I expect a much greater impact from this: flip the order of index expressions:

CREATE INDEX ON big_table (val, time DESC)
WHERE flag & 2::smallint = 2::smallint;

Rule of thumb: index for equality first — then for ranges. See:

Multicolumn index and performance

To your consolation: val = ANY('{-1,5}') burns down to being syntax shorthand for (val = -1 OR val = 5), which is hardly better than (val IS NULL OR val = 5). (The more important factor is number of rows for NULL vs. -1 - the same in your case if you just replace NULL with -1.).

Also consider upgrading to a current version of Postgres. 9.5 is getting old, there have been various performance improvements for big tables.

To return only few columns, an index-only scan would be a possible optimization, but you need to return most of your 21 columns according to comments.

To save 8 bytes per row needlessly lost to alignment padding, reorder the columns of your demo table like this:

CREATE TABLE big_table (
   flag bigint NOT NULL,
   time timestamp with timezone NOT NULL,
   id integer PRIMARY KEY,
   val int
);

Smaller is faster overall. Now, that's obviously just a demo table, but the same principles apply to your actual table. See:

Sort

For a single val, Postgres can return pre-sorted data from the index directly. But for more than one value it has to merge equally many sorted sub-sets (one sorted set for val IS NULL, another one for val = 5 in your example), so another sort step on top of index access is inevitable. Pre-sorted sets from the index still can make it cheaper - and you need sorted index tuples in any case. The actual query plan also depends on the chosen index access method. It's trivial to return pre-sorted data from an index scan (or index-only scan). Not so much for a bitmap index scan.

Special case: very few `NULL` values, used all the time

Since you mentioned only a handful of NULL values for millions of rows I would add a tailored index for that special case:

CREATE INDEX ON big_table (time DESC)
WHERE flag & 2::smallint = 2::smallint
AND   val IS NULL;

Maybe even append all other columns of interest to this very tiny special index to get an index-only scan. (Depends on preconditions.) Even more so in Postgres 11 or later with a true covering index using the INCLUDE clause. Results from this and the other index are merged in a BitmapOr node, just like you see for multiple subset from the same index now. Postgres has precise estimates for the special case and it becomes completely irrelevant whether the special case is NULL or -1 or whatever. (Not that it mattered all that much to begin with.) See:

Best Answer

Sort

Special case: very few NULL values, used all the time

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Postgresql – How to index WHERE (start_date >= ‘2013-12-15’)

Related Question

Special case: very few `NULL` values, used all the time