I've often read when one had to check existence of a row should always be done with EXISTS instead of with a COUNT.
Yet in several recent scenarios I've measured a performance improvement when using count.
The pattern goes like this:
LEFT JOIN (
SELECT
someID
, COUNT(*)
FROM someTable
GROUP BY someID
) AS Alias ON (
Alias.someID = mainTable.ID
)
I'm not familiar with methods to tell what's happening "inside" SQL Server so I was wondering if there was a unheralded flaw with EXISTS that gave perfectly sense to the measurements I've done (could EXISTS be RBAR?!).
Do you have some explanation to that phenomena?
EDIT:
Here's a full script you can run:
SET NOCOUNT ON
SET STATISTICS IO OFF
DECLARE @tmp1 TABLE (
ID INT UNIQUE
)
DECLARE @tmp2 TABLE (
ID INT
, X INT IDENTITY
, UNIQUE (ID, X)
)
; WITH T(n) AS (
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master.dbo.spt_values AS S
)
, tally(n) AS (
SELECT
T2.n * 100 + T1.n
FROM T AS T1
CROSS JOIN T AS T2
WHERE T1.n <= 100
AND T2.n <= 100
)
INSERT @tmp1
SELECT n
FROM tally AS T1
WHERE n < 10000
; WITH T(n) AS (
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master.dbo.spt_values AS S
)
, tally(n) AS (
SELECT
T2.n * 100 + T1.n
FROM T AS T1
CROSS JOIN T AS T2
WHERE T1.n <= 100
AND T2.n <= 100
)
INSERT @tmp2
SELECT T1.n
FROM tally AS T1
CROSS JOIN T AS T2
WHERE T1.n < 10000
AND T1.n % 3 <> 0
AND T2.n < 1 + T1.n % 15
PRINT '
COUNT Version:
'
WAITFOR DELAY '00:00:01'
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT
T1.ID
, CASE WHEN n > 0 THEN 1 ELSE 0 END AS DoesExist
FROM @tmp1 AS T1
LEFT JOIN (
SELECT
T2.ID
, COUNT(*) AS n
FROM @tmp2 AS T2
GROUP BY T2.ID
) AS T2 ON (
T2.ID = T1.ID
)
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (RECOMPILE) -- Required since table are filled within the same scope
SET STATISTICS TIME OFF
PRINT '
EXISTS Version:'
WAITFOR DELAY '00:00:01'
SET STATISTICS TIME ON
SELECT
T1.ID
, CASE WHEN EXISTS (
SELECT 1
FROM @tmp2 AS T2
WHERE T2.ID = T1.ID
) THEN 1 ELSE 0 END AS DoesExist
FROM @tmp1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (RECOMPILE) -- Required since table are filled within the same scope
SET STATISTICS TIME OFF
On SQL Server 2008R2 (Seven 64bits) I get this result
COUNT
Version:
Table '#455F344D'. Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#492FC531'. Scan count 1, logical reads 30, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 81 ms.
EXISTS
Version:
Table '#492FC531'. Scan count 1, logical reads 96, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#455F344D'. Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 76 ms.
Best Answer
It's very rare for anything to always be true, especially when it comes to databases. There are any number of ways to express the same semantic in SQL. If there is a useful rule of thumb, it might be to write queries using the most natural syntax available (and, yes, that is subjective) and only consider rewrites if the query plan or performance you get is unacceptable.
For what it's worth, my own take on the issue is that existence queries are most naturally expressed using
EXISTS
. It has also been my experience thatEXISTS
tends to optimize better than theOUTER JOIN
rejectNULL
alternative. UsingCOUNT(*)
and filtering on=0
is another alternative, that happens to have some support in the SQL Server query optimizer, but I have personally found this to be unreliable in more complex queries. In any case,EXISTS
just seems much more natural (to me) than either of those alternatives.Your particular example is interesting, because it highlights the way the optimizer deals with subqueries in
CASE
expressions (andEXISTS
tests in particular).Subqueries in CASE expressions
Consider the following (perfectly legal) query:
The semantics of
CASE
are thatWHEN/ELSE
clauses are generally evaluated in textual order. In the query above, it would be incorrect for SQL Server to return an error if theELSE
subquery returned more than one row, if theWHEN
clause was satisfied. To respect these semantics, the optimizer produces a plan that uses pass-through predicates:The inner side of the nested loop joins are only evaluated when the pass-through predicate returns false. The overall effect is that
CASE
expressions are tested in order, and subqueries are only evaluated if no previous expression was satisfied.CASE expressions with an EXISTS subquery
Where a
CASE
subquery usesEXISTS
, the logical existence test is implemented as a semi-join, but rows that would normally be rejected by the semi-join have to be retained in case a later clause needs them. Rows flowing through this special kind of semi-join acquire a flag to indicate if the semi-join found a match or not. This flag is known as the probe column.The details of the implementation is that the logical subquery is replaced by a correlated join ('apply') with a probe column. The work is performed by a simplification rule in the query optimizer called
RemoveSubqInPrj
(remove subquery in projection). We can see the details using trace flag 8606:Part of the input tree showing the
EXISTS
test is shown below:This is transformed by
RemoveSubqInPrj
to a structure headed by:This is the left semi-join apply with probe described previously. This initial transformation is the only one available in SQL Server query optimizers to date, and compilation will simply fail if this transformation is disabled.
One of the possible execution plan shapes for this query is a direct implementation of that logical structure:
The final Compute Scalar evaluates the result of the
CASE
expression using the probe column value:The basic shape of the plan tree is preserved when the optimize considers other physical join types for the semi join. Only merge join supports a probe column, so a hash semi join, though logically possible, is not considered:
Notice the merge outputs an expression labelled
Expr1008
(that the name is the same as before is a coincidence) though no definition for it appears on any operator in the plan. This is just the probe column again. As before, the final Compute Scalar uses this probe value to evaluate theCASE
.The problem is that the optimizer doesn't fully explore alternatives that only become worthwhile with merge (or hash) semi join. In the nested loops plan, there is no advantage to checking if rows in
T2
match the range on every iteration. With a merge or hash plan, this could be a useful optimization.If we add a matching
BETWEEN
predicate toT2
in the query, all that happens is that this check is performed for each row as a residual on the merge semi join (tough to spot in the execution plan, but it is there):We would hope that the
BETWEEN
predicate would instead be pushed down toT2
resulting in a seek. Normally, the optimizer would consider doing this (even without the extra predicate in the query). It recognizes implied predicates (BETWEEN
onT1
and the join predicate betweenT1
andT2
together imply theBETWEEN
onT2
) without them being present in the original query text. Unfortunately, the apply-probe pattern means this is not explored.There are ways to write the query to produce seeks on both inputs to a merge semi join. One way involves writing the query in quite an unnatural way (defeating the reason I generally prefer
EXISTS
):I wouldn't be happy writing that query in a production environment, it's just to demonstrate that the desired plan shape is possible. If the real query you need to write uses
CASE
in this particular way, and performance suffers by there not being a seek on the probe side of a merge semi-join, you might consider writing the query using different syntax that produces the right results and a more efficient execution plan.