Sql-server – How to efficiently check EXISTS on multiple columns

performancesql server

This is an issue I come up against periodically and have not yet found a good solution for.

Supposing the following table structure

CREATE TABLE T
(
A INT PRIMARY KEY,
B CHAR(1000) NULL,
C CHAR(1000) NULL
)

and the requirement is to determine whether either of the nullable columns B or C actually contain any NULL values (and if so which one(s)).

Also assume the table contains millions of rows (and that no column statistics are available that could be peeked at as I am interested in a more generic solution for this class of queries).

I can think of a few ways of approaching this but all have weaknesses.

Two separate EXISTS statements. This would have the advantage of allowing the queries to stop scanning early as soon as a NULL is found. But if both columns in fact contain no NULLs then two full scans will result.

Single Aggregate Query

SELECT 
    MAX(CASE WHEN B IS NULL THEN 1 ELSE 0 END) AS B,
    MAX(CASE WHEN C IS NULL THEN 1 ELSE 0 END) AS C
FROM T

This could process both columns at the same time so have a worst case of one full scan.
The disadvantage is that even if it encounters a NULL in both columns very early on the query will still end up scanning the whole of the rest of the table.

User variables

I can think of a third way of doing this

BEGIN TRY
DECLARE @B INT, @C INT, @D INT

SELECT 
    @B = CASE WHEN B IS NULL THEN 1 ELSE @B END,
    @C = CASE WHEN C IS NULL THEN 1 ELSE @C END,
    /*Divide by zero error if both @B and @C are 1.
    Might happen next row as no guarantee of order of
    assignments*/
    @D = 1 / (2 - (@B + @C))
FROM T  
OPTION (MAXDOP 1)       
END TRY
BEGIN CATCH
IF ERROR_NUMBER() = 8134 /*Divide by zero*/
    BEGIN
    SELECT 'B,C both contain NULLs'
    RETURN;
    END
ELSE
    RETURN;
END CATCH

SELECT ISNULL(@B,0),
       ISNULL(@C,0)

but this is not suitable for production code as the correct behavior for an aggregate concatenation query is undefined. and terminating the scan by throwing an error is quite a horrible solution anyway.

Is there another option that combines the strengths of the approaches above?

Edit

Just to update this with the results I get in terms of reads for the answers submitted so far (using @ypercube's test data)

+----------+------------+------+---------+----------+----------------------+----------+------------------+
|          | 2 * EXISTS | CASE | Kejser  |  Kejser  |        Kejser        | ypercube |       8kb        |
+----------+------------+------+---------+----------+----------------------+----------+------------------+
|          |            |      |         | MAXDOP 1 | HASH GROUP, MAXDOP 1 |          |                  |
| No Nulls |      15208 | 7604 |    8343 | 7604     | 7604                 |    15208 | 8346 (8343+3)    |
| One Null |       7613 | 7604 |    8343 | 7604     | 7604                 |     7620 | 7630 (25+7602+3) |
| Two Null |         23 | 7604 |    8343 | 7604     | 7604                 |       30 | 30 (18+12)       |
+----------+------------+------+---------+----------+----------------------+----------+------------------+

For @Thomas's answer I changed TOP 3 to TOP 2 to potentially allow it to exit earlier. I got a parallel plan by default for that answer so also tried it with a MAXDOP 1 hint in order to make the number of reads more comparable to the other plans. I was somewhat surprised by the results as in my earlier test I had seen that query short circuit without reading the whole table.

The plan for my test data that short circuits is below

Shortcircuits

The plan for ypercube's data is

Not Shortcircuit

So it adds a blocking sort operator to the plan. I also tried with the HASH GROUP hint but that still ends up reading all the rows

Not Shortcircuit

So the key seems to be to get a hash match (flow distinct) operator to allow this plan to short circuit as the other alternatives will block and consume all rows anyway. I don't think there is hint to force this specifically but apparently "in general, the optimiser chooses a Flow Distinct where it determines that fewer output rows are required than there are distinct values in the input set.".

@ypercube's data only has 1 row in each column with NULL values (table cardinality = 30300) and the estimated rows going into and out of the operator are both 1. By making the predicate a bit more opaque to the optimiser it generated a plan with the Flow Distinct operator.

SELECT TOP 2 *
FROM (SELECT DISTINCT 
        CASE WHEN b IS NULL THEN NULL ELSE 'foo' END AS b
      , CASE WHEN c IS NULL THEN NULL ELSE 'bar' END AS c
  FROM test T 
  WHERE LEFT(b,1) + LEFT(c,1) IS NULL
) AS DT 

Edit 2

One last tweak that occurred to me is that the query above could still end up processing more rows than necessary in the event that the first row it encounters with a NULL has NULLs in both column B and C. It will continue scanning rather than exiting immediately. One way of avoiding this would be to unpivot the rows as they are scanned. So my final amend to Thomas Kejser's answer is below

SELECT DISTINCT TOP 2 NullExists
FROM test T 
CROSS APPLY (VALUES(CASE WHEN b IS NULL THEN 'b' END),
                   (CASE WHEN c IS NULL THEN 'c' END)) V(NullExists)
WHERE NullExists IS NOT NULL

It would probably be better for the predicate to be WHERE (b IS NULL OR c IS NULL) AND NullExists IS NOT NULL but against the previous test data that one doesn't give me a plan with a Flow Distinct, whereas the NullExists IS NOT NULL one does (plan below).

Unpivoted

Best Answer

How about:

SELECT TOP 3 *
FROM (SELECT DISTINCT 
        CASE WHEN B IS NULL THEN NULL ELSE 'foo' END AS B
        , CASE WHEN C IS NULL THEN NULL ELSE 'bar' END AS C
  FROM T 
  WHERE 
    (B IS NULL AND C IS NOT NULL) 
    OR (B IS NOT NULL AND C IS NULL) 
    OR (B IS NULL AND C IS NULL)
) AS DT