This question is a take-off from the excellent one posed here:
Cast to date is sargable but is it a good idea?
In my case, I am not concerned with the WHERE
clause but in joining to an events table which has a column of type DATE
One table has DATETIME2
and the other has DATE
… so I can effectively JOIN
using a CAST( AS DATE)
or I can use a "traditional" range query (>= date AND < date+1).
My question is which is preferable? The DATETIME
values will almost never match the predicate DATE
value.
I expect to stay on the order of 2M rows having the DATETIME
and under 5k having the DATE
(if this consideration makes a difference)
Should I expect the same behavior on the JOIN
as I might using the WHERE
clause? Which should I prefer to retain performance with scaling? Does the answer change with MSSQL 2012?
My generalized use-case is to treat my events table like a calendar table
SELECT
events.columns
,SOME_AGGREGATIONS(tasks.column)
FROM
events
LEFT OUTER JOIN
tasks
--This appropriately states my intent clearer
ON CAST(tasks.datetimecolumn AS DATE) = events.datecolumn
--But is this more effective/scalable?
--ON tasks.datetimecolumn >= events.datecolumn
--AND tasks.datetimecolumn < DATEADD(day,1,events.datecolumn)
GROUP BY
events.columns
Best Answer
"It depends".
One advantage of the
=
predicate andcast
to date is that the join can be hash or merge. The range version will force a nested loops plan.If there are no useful indexes to seek into the
datetimecolumn
ontasks
this would make a substantial difference.Setting up the 5K/ 2 million rows of test data mentioned in the question
Then turning on
And trying the
CAST
versionCompleted in 7.4 seconds
The estimated number of rows coming out of the join and into the
GROUP BY
was far too small (5006.27 vs actual 2,000,000) and the hash aggregate spilled totempdb
Trying the range predicate
The lack of an equality predicate forces a nested loops plan. As there are no useful indexes to support this query it has no option but to scan the 2 million row table 5,000 times.
On my machine that gave a parallel plan that eventually completed after 1 minute 40 seconds.
This time the number of rows coming out of the join and into the aggregate was grossly over estimated (at estimated 124,939,000 vs actual 2,000,000)
Repeating the experiment after altering the tables to make the respective date/time columns the clustered primary key altered the results.
Both queries ended up choosing a nested loops plan. The
CAST
asDATE
version gave a serial version that completed in 4.5 seconds and the range version a parallel plan that completed in elapsed time 1.1 seconds with CPU time of 3.2 seconds.Applying
MAXDOP 1
to the second query to make the figures more easily comparable returns the following.Query 1
Query 2
Query 1 had an estimated 5006.73 rows coming out of the join and the hash aggregate spilled to
tempdb
again.Query 2 again has a large overestimate (at 120,927,000 this time).
The other obvious difference between the two results is that the range query looks like it manages to seek into
tasks
more efficiently in some way. Only reading49,440
pages vs78,137
.The range that the cast as date version seeks into is derived from an internal function
GetRangeThroughConvert
. The plan shows a residual predicate onCONVERT(date,[dbo].[tasks].[datetimecolumn],0)= [dbo].[events].[datecolumn]
.If Query 2 is altered to
Then the number of reads then becomes the same. The dynamic seek used by the
CAST AS DATE
version reads unnecessary rows (two days worth rather than one) and then discards them with the residual predicate.One other possibility would be to restructure the table to store the
date
andtime
components in different columns.The
datetimecolumn
can be derived from the component parts and this has no effect on row size (as width ofdate
+time(n)
is the same as the width ofdatetime2(n)
). (With an exception being if the additional column increases the size of theNULL_BITMAP
)The query is then a straight forward
=
predicateThis would allow a merge join between the tables without any need to sort. Though for these table sizes a nested loops join was chosen anyway with stats as below.
As well as potentially allowing different logical join types storing the
date
separately as the leading index column would also potentially benefit other queries ontasks
such as grouping by date.As for why the
=
predicate shows fewer logical reads ontasks
than the> <=
version with the same nested loops plan (44,285
vs49,440
) this appears to be related to the read ahead mechanism.Turning on trace flag
652
reduces the logical reads of the range version to the same as that of the equals version.