Row count incorrectly displayed as 0 in redshift

countredshift

I have a table in redshift with columns user_id and some_property, neither of which are unique.

The following query returns 0:

SELECT COUNT(DISTINCT user_id) FROM table
WHERE user_id NOT IN (SELECT DISTINCT user_id FROM table WHERE some_property = 1)

Suggesting that for every user_id in the table, there is an entry with some_property=1.

However if I run

SELECT COUNT(DISTINCT user_id) FROM table

then it returns a number greater than

SELECT COUNT(DISTINCT user_id) FROM table WHERE some_property = 1

To me this is a logical contradiction. What am I missing here?

Best Answer

Given the following table:

CREATE TABLE 
    t 
AS
SELECT 
    * 
FROM 
(
    VALUES
        (   1,    1),
        (   2,    0),
        (NULL,    1)
) AS x (user_id, some_property) ;

The SELECT COUNT from the original poster returns 0, the second one returns 2, and the last one returns 1.


Let's see why:

If you take out the counts and retrieve the rows, we get:

(1)

SELECT DISTINCT 
    user_id
FROM 
    t
WHERE user_id NOT IN (SELECT DISTINCT user_id FROM t WHERE some_property = 1) ;

|---------|
| user_id |
|---------|

(that is: nothing. The count, in this case, is clearly 0)

(2)

SELECT DISTINCT user_id FROM t ;

|---------|
| user_id |
|---------|
|   NULL  |
|---------|
|    1    |
|---------|
|    2    |
|---------|

(there are 3 rows, but the count is 2, because you don't count NULL)

And last,

(3)

SELECT DISTINCT user_id FROM t where some_property = 1 ;

|---------|
| user_id |
|---------|
|    1    |
|---------|
|   NULL  |
|---------|

(There are 2 rows, but Count is 1, because, again, you don't count NULL).


You have to take into account that, according to SQL standard:

a NOT IN (x, y, z)                is equivalent to   
NOT (a IN (x, y, z))              which in turn, is equivalent to
NOT (a = x OR a = y OR a = z)

If a IS NULL, you get NOT (NULL = x OR NULL = y OR NULL = z) which is NULL, regardless of the values of x, y or z.