PostgreSQL – When Is COUNT(x OR NULL) Needed in Gaps and Islands?

countgaps-and-islandsnullpostgresqlwindow functions

In this answer, Erwin Brandstetter says:

count(step OR NULL) OVER (ORDER BY date) is the shortest syntax that also works in Postgres 9.3 or older. count() only counts non-null values. In modern Postgres, the cleaner, equivalent syntax would be:
count(step) FILTER (WHERE step) OVER (ORDER BY date)

I'm unsure of why count(step OR NULL) is preferred though. In my query, I do the following. I renamed my variables to match his while maintaining the syntax.

CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS step

We're counting the values returned by that. Note, that the case can only return 1, or null.

if the two ARE NOT equal, 1 is returned.
if they are equal, it returns null which isn't counted.

Erwin's answer has:

This assumes involved columns are NOT NULL. Else you need to do more.

So I'm even more confused. What is the point of adding count(step OR NULL) what is this protecting our query against?

Could anyone break this down and perhaps show two examples with data wherein only one of them – the one with – count(x OR NULL) works?

Best Answer

It's all about NULL handling. The expression lag(id_type) OVER (ORDER BY date) <> id_type evaluates to a boolean value, which can be TRUE, FALSE or NULL. If either the left or the right operand is NULL, we get NULL. There are various ways NULL can be introduced on either side:

No previous (or next or similar in related cases) row.
Column can be NULL in underlying table.
An OUTER JOIN introduces NULL (no matching row found).
An expression is involved that can evaluate to NULL.

There are various ways to fold three-valued logic to just two cases. A CASE statement, an OR expression etc. Related answer with details:

Select longest continuous sequence

Here is a comparison of the involved techniques, plus (updated) benchmark. Basically, they all perform the same, it's a cheap operation. The aggregate FILTER clause, introduced Postgres 9.4, seems a bit faster in the updated benchmark:

There is no case wherein only one of them works, the techniques are interchangeable. What to count or sum exactly depends on details of setup and requirements. In this case we only count TRUE - as long as the underlying table or query yields all non-null values (the corner case of the first row is open for debate).

And this case only requires a simple comparison. Related cases need to look at the previous or next value more closely. Then it may come in handy to provide a default value as 3rd parameter to lag() or lead().

Related Solutions

MySQL IS NULL / IS NOT NULL Misbehaving

Do you have some zero dates? Datetime values of 0000-00-00 00:00:00 are considered by MySQL to simultaneously satisfy is null and is not null:

steve@steve@localhost > create temporary table _tmp (a datetime not null);
Query OK, 0 rows affected (0.02 sec)

steve@steve@localhost > insert into _tmp values ('');
Query OK, 1 row affected, 1 warning (0.00 sec)

Warning (Code 1264): Out of range value for column 'a' at row 1
steve@steve@localhost > select a from _tmp where a is null;
+---------------------+
| a                   |
+---------------------+
| 0000-00-00 00:00:00 |
+---------------------+
1 row in set (0.00 sec)

steve@steve@localhost > select a from _tmp where a is not null;
+---------------------+
| a                   |
+---------------------+
| 0000-00-00 00:00:00 |
+---------------------+
1 row in set (0.00 sec)

See: http://bugs.mysql.com/bug.php?id=940

This is classified as "not a bug". They suggest a workaround: use strict mode, which will convert the insertion warning into an error.

Having said all that, this alone can't explain the wild variation in the results you're getting (the sum of the is null and is not null counts should exceed the unrestricted count)...

Display non-null and null count using case statement

This may not directly help you, since I only know SQL Server, but in any case this is how I would approach this:

Sample Data:

CREATE TABLE #Example
(
    Eid integer NULL
);

INSERT #Example
    (Eid)
VALUES
    (1),
    (2),
    (3),
    (4),
    (5),
    (NULL),
    (NULL),
    (NULL),
    (NULL),
    (NULL);

My idea revolves around the idea that COUNT(column) does not count NULLs, whereas COUNT(*) does:

SELECT 
    [Number] = COUNT(E.Eid), 
    [Null] = COUNT(*) - COUNT(E.Eid)
FROM #Example AS E;

Output 1:

+--------+------+
| Number | Null |
+--------+------+
|      5 |    5 |
+--------+------+

If the specific arrangement of rows and columns is important, I would UNPIVOT:

WITH Previous AS
(
    -- Same query as above
    SELECT 
        [Number] = COUNT(E.Eid), 
        [Null] = COUNT(*) - COUNT(E.Eid)
    FROM #Example AS E
)
SELECT
    U.Eid,
    U.Counts
FROM Previous
UNPIVOT
(
    Counts FOR Eid IN ([Number], [Null])
) AS U;

Output 2:

+--------+--------+
|  Eid   | Counts |
+--------+--------+
| Number |      5 |
| Null   |      5 |
+--------+--------+

Using `CASE`

SELECT 
    C.Name, 
    Counts = COUNT(*) 
FROM 
(
    SELECT 
        Name =
            CASE 
                WHEN E.Eid IS NULL 
                THEN 'Null' 
                ELSE 'Number' 
            END 
    FROM #Example AS E
) AS C
GROUP BY
    C.Name;

Output 3:

+--------+--------+
|  Name  | Counts |
+--------+--------+
| Null   |      5 |
| Number |      5 |
+--------+--------+

Edit:

This is the query form Babu ended up using:

select 
    count(*) as cnt, 
    case 
        when smpth is null 
        then 'NULL' 
        else 'NUMBER' 
    end numbertest 
from employees 
group by 
    case 
        when smpth is null 
        then 'NULL' 
        else 'NUMBER' 
    end numbertest