Row count incorrectly displayed as 0 in redshift

countredshift

I have a table in redshift with columns user_id and some_property, neither of which are unique.

The following query returns 0:

SELECT COUNT(DISTINCT user_id) FROM table
WHERE user_id NOT IN (SELECT DISTINCT user_id FROM table WHERE some_property = 1)

Suggesting that for every user_id in the table, there is an entry with some_property=1.

However if I run

SELECT COUNT(DISTINCT user_id) FROM table

then it returns a number greater than

SELECT COUNT(DISTINCT user_id) FROM table WHERE some_property = 1

To me this is a logical contradiction. What am I missing here?

Best Answer

Given the following table:

CREATE TABLE 
    t 
AS
SELECT 
    * 
FROM 
(
    VALUES
        (   1,    1),
        (   2,    0),
        (NULL,    1)
) AS x (user_id, some_property) ;

The SELECT COUNT from the original poster returns 0, the second one returns 2, and the last one returns 1.

Let's see why:

If you take out the counts and retrieve the rows, we get:

(1)

SELECT DISTINCT 
    user_id
FROM 
    t
WHERE user_id NOT IN (SELECT DISTINCT user_id FROM t WHERE some_property = 1) ;

|---------|
| user_id |
|---------|

(that is: nothing. The count, in this case, is clearly 0)

(2)

SELECT DISTINCT user_id FROM t ;

|---------|
| user_id |
|---------|
|   NULL  |
|---------|
|    1    |
|---------|
|    2    |
|---------|

(there are 3 rows, but the count is 2, because you don't count NULL)

And last,

(3)

SELECT DISTINCT user_id FROM t where some_property = 1 ;

|---------|
| user_id |
|---------|
|    1    |
|---------|
|   NULL  |
|---------|

(There are 2 rows, but Count is 1, because, again, you don't count NULL).

You have to take into account that, according to SQL standard:

a NOT IN (x, y, z)                is equivalent to   
NOT (a IN (x, y, z))              which in turn, is equivalent to
NOT (a = x OR a = y OR a = z)

If a IS NULL, you get NOT (NULL = x OR NULL = y OR NULL = z) which is NULL, regardless of the values of x, y or z.

Related Solutions

Ms-access – “No Current Record” when using COUNT() in Access 2010

Okay, well, since nobody else chimed in and I needed to get it done, I took one last shot at it. Not quite sure on the why of it all, but I used basically the entire query as a subquery in the FROM clause, and then ran the report against that, and that seemed to do it...

SELECT
  [VendorCusts].[VendorPaid],
  [VendorCusts].[DE?],
  [VendorCusts].[IC?],
  [VendorCusts].[AG?],
  [VendorCusts].[GB?],
  COUNT(VendorCustID) AS [CustCount]
FROM
  (
    SELECT DISTINCT
      [Plan Revenue Expense].[Check To] AS [VendorPaid],
      [Support Provider].[DE?],
      [Support Provider].[IC?],
      [Support Provider].[Agency?] AS [AG?],
      [Support Provider].[GeneralBus?] AS [GB?],
      Customer.CustID AS [VendorCustID],
      Customer.LName AS [VendorCustLName],
      Customer.FName AS [VendorCustFName]
    FROM
      (((Customer
      INNER JOIN Plan ON Customer.CustID=Plan.CustID)
      INNER JOIN [Plan Revenue] ON [Plan].[Plan ID]=[Plan Revenue].[PlanID])
      INNER JOIN [Plan Revenue Expense] ON [Plan Revenue].[Rev ID]=[Plan Revenue Expense].[RevID])
      LEFT JOIN [Support Provider] ON [Plan Revenue Expense].[SP]=[Support Provider].[ID]
    WHERE
      (
        (
          ([Plan Revenue Expense].[First Day]>=[Expense Start "MM/DD/YYYY"])
          AND
          ([Plan Revenue Expense].[First Day]<=[Expense End "MM/DD/YYYY"])
        )
        OR
        (
          ([Plan Revenue Expense].[Last Day]>=[Expense Start "MM/DD/YYYY"])
          AND
          ([Plan Revenue Expense].[Last Day]<=[Expense End "MM/DD/YYYY"])
        )
      )
      AND NOT
      (
        [Plan Revenue].[Service]='111' OR
        [Plan Revenue].[Service]='222' OR
        [Plan Revenue].[Service] LIKE '333*'
      )
      AND NOT Customer.[Inactive?]=TRUE
    ORDER BY
      [Plan Revenue Expense].[Check To],
      Customer.LName,
      Customer.FName
  ) AS [VendorCusts]
GROUP BY
  [VendorCusts].[VendorPaid],
  [VendorCusts].[DE?],
  [VendorCusts].[IC?],
  [VendorCusts].[AG?],
  [VendorCusts].[GB?]
HAVING
  NOT ([VendorCusts].[DE?]=TRUE OR [VendorCusts].[IC?]=TRUE)
ORDER BY
  [VendorCusts].[DE?] ASC,
  [VendorCusts].[IC?] ASC,
  [VendorCusts].[AG?] ASC,
  [VendorCusts].[GB?] ASC,
  [VendorCusts].[VendorPaid] ASC;

This is more or less what I finally came up with and it appears to get the job done. Checked a few data points and they seemed to agree with the slightly more verbose version w/o a Count() function. So, I think the results are good.

Then I abstracted out the start and end of the date range so I could run a few different versions. And added a few other parameters I needed...

Seems like it's working as intended, if a bit kludge-y. Sometimes it's better to be 'right' than 'pretty.' ;)

Redshift DISTSTYLE ALL table takes less disk space

Well, after lots of experimentation, we have found that, yes, if you have a fat table with not enough rows to fill much of a block at the slice level using DISTSTYLE KEY, it is much more efficient, space-wise, to use DISTSTYLE ALL, even though this replicates more data.

For example, our fat tables have, say, 600000 rows and 1000 columns. Then, on a 5-node 32-slice setup, they were using up 1Mb per column per node per slice, or 160 Gb, and we were only filling less than 1% of that space.

DISTSTYLE ALL appears to not use all the slices.

So using DISTSTYLE ALL, we multiplied the data by 5, but this still takes up less than 1/20th of the space that the DISTSTYLE KEY tables did.

It slows down our queries a little (due to the need for IPC between the slices, and because the DISTSTYLE KEY means we have only the data on the node that matches our other data) but, since these tables are loaded once and never change after, we don't suffer from the other problems of DISTSTYLE ALL.

Best Answer

Related Solutions

Ms-access – “No Current Record” when using COUNT() in Access 2010

Redshift DISTSTYLE ALL table takes less disk space

Related Question