PostgreSQL – When Is COUNT(x OR NULL) Needed in Gaps and Islands?

countgaps-and-islandsnullpostgresqlwindow functions

In this answer, Erwin Brandstetter says:

count(step OR NULL) OVER (ORDER BY date) is the shortest syntax that also works in Postgres 9.3 or older. count() only counts non-null values. In modern Postgres, the cleaner, equivalent syntax would be:

count(step) FILTER (WHERE step) OVER (ORDER BY date)

I'm unsure of why count(step OR NULL) is preferred though. In my query, I do the following. I renamed my variables to match his while maintaining the syntax.

CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS step

We're counting the values returned by that. Note, that the case can only return 1, or null.

  • if the two ARE NOT equal, 1 is returned.
  • if they are equal, it returns null which isn't counted.

Erwin's answer has:

This assumes involved columns are NOT NULL. Else you need to do more.

So I'm even more confused. What is the point of adding count(step OR NULL) what is this protecting our query against?

Could anyone break this down and perhaps show two examples with data wherein only one of them – the one with – count(x OR NULL) works?

Best Answer

It's all about NULL handling. The expression lag(id_type) OVER (ORDER BY date) <> id_type evaluates to a boolean value, which can be TRUE, FALSE or NULL. If either the left or the right operand is NULL, we get NULL. There are various ways NULL can be introduced on either side:

  • No previous (or next or similar in related cases) row.
  • Column can be NULL in underlying table.
  • An OUTER JOIN introduces NULL (no matching row found).
  • An expression is involved that can evaluate to NULL.

There are various ways to fold three-valued logic to just two cases. A CASE statement, an OR expression etc. Related answer with details:

Here is a comparison of the involved techniques, plus (updated) benchmark. Basically, they all perform the same, it's a cheap operation. The aggregate FILTER clause, introduced Postgres 9.4, seems a bit faster in the updated benchmark:

There is no case wherein only one of them works, the techniques are interchangeable. What to count or sum exactly depends on details of setup and requirements. In this case we only count TRUE - as long as the underlying table or query yields all non-null values (the corner case of the first row is open for debate).

And this case only requires a simple comparison. Related cases need to look at the previous or next value more closely. Then it may come in handy to provide a default value as 3rd parameter to lag() or lead().