SQL Server – Should I Join Datetime to a Date Using CAST or Range?

datetimesql-server-2008-r2sql-server-2012t-sql

This question is a take-off from the excellent one posed here:

Cast to date is sargable but is it a good idea?

In my case, I am not concerned with the WHERE clause but in joining to an events table which has a column of type DATE

One table has DATETIME2 and the other has DATE… so I can effectively JOIN using a CAST( AS DATE) or I can use a "traditional" range query (>= date AND < date+1).

My question is which is preferable? The DATETIME values will almost never match the predicate DATE value.

I expect to stay on the order of 2M rows having the DATETIME and under 5k having the DATE (if this consideration makes a difference)

Should I expect the same behavior on the JOIN as I might using the WHERE clause? Which should I prefer to retain performance with scaling? Does the answer change with MSSQL 2012?

My generalized use-case is to treat my events table like a calendar table

SELECT
    events.columns
    ,SOME_AGGREGATIONS(tasks.column)
FROM
    events
LEFT OUTER JOIN
    tasks
        --This appropriately states my intent clearer
        ON CAST(tasks.datetimecolumn AS DATE) = events.datecolumn 
        --But is this more effective/scalable?
        --ON tasks.datetimecolumn >= events.datecolumn 
        --AND tasks.datetimecolumn < DATEADD(day,1,events.datecolumn)
GROUP BY
    events.columns

Best Answer

"It depends".

One advantage of the = predicate and cast to date is that the join can be hash or merge. The range version will force a nested loops plan.

If there are no useful indexes to seek into the datetimecolumn on tasks this would make a substantial difference.

Setting up the 5K/ 2 million rows of test data mentioned in the question

CREATE TABLE events
  (
     eventId    INT IDENTITY PRIMARY KEY,
     datecolumn DATE NOT NULL,
     details    CHAR(1000) DEFAULT 'D'
  )

INSERT INTO events
            (datecolumn)
SELECT TOP 5000 DATEADD(DAY, ROW_NUMBER() OVER (ORDER BY @@SPID), GETDATE())
FROM   spt_values v1,
       spt_values v2

CREATE TABLE tasks
  (
     taskId         INT IDENTITY PRIMARY KEY,
     datetimecolumn DATETIME2 NOT NULL,
     details        CHAR(1000) DEFAULT 'D'
  );

WITH N
     AS (SELECT number
         FROM   spt_values
         WHERE  number BETWEEN 1 AND 40
                AND type = 'P')
INSERT INTO tasks
            (datetimecolumn)
SELECT DATEADD(MINUTE, number, CAST(datecolumn AS DATETIME2))
FROM   events,
       N

Then turning on

SET STATISTICS IO ON;
SET STATISTICS TIME ON;

And trying the CAST version

SELECT events.eventId,
       MAX(tasks.details)
FROM   events
       LEFT OUTER JOIN tasks
         ON CAST(tasks.datetimecolumn AS DATE) = events.datecolumn
GROUP  BY events.eventId

Completed in 7.4 seconds

Table 'Worktable'. Scan count 0, logical reads 0
Table 'tasks'. Scan count 1, logical reads 28679
Table 'events'. Scan count 1, logical reads 719

   CPU time = 3042 ms,  elapsed time = 7434 ms.

The estimated number of rows coming out of the join and into the GROUP BY was far too small (5006.27 vs actual 2,000,000) and the hash aggregate spilled to tempdb

Plan 1

Trying the range predicate

SELECT events.eventId,
       MAX(tasks.details)
FROM   events
       LEFT OUTER JOIN tasks
         ON tasks.datetimecolumn >= events.datecolumn
            AND tasks.datetimecolumn < DATEADD(day, 1, events.datecolumn)
GROUP  BY events.eventId 

The lack of an equality predicate forces a nested loops plan. As there are no useful indexes to support this query it has no option but to scan the 2 million row table 5,000 times.

On my machine that gave a parallel plan that eventually completed after 1 minute 40 seconds.

Table 'tasks'. Scan count 4, logical reads 143390000
Table 'events'. Scan count 5, logical reads 788
Table 'Worktable'. Scan count 0, logical reads 0
  CPU time = 368193 ms,  elapsed time = 100528 ms.

This time the number of rows coming out of the join and into the aggregate was grossly over estimated (at estimated 124,939,000 vs actual 2,000,000)

Plan 2

Repeating the experiment after altering the tables to make the respective date/time columns the clustered primary key altered the results.

Both queries ended up choosing a nested loops plan. The CAST as DATE version gave a serial version that completed in 4.5 seconds and the range version a parallel plan that completed in elapsed time 1.1 seconds with CPU time of 3.2 seconds.

Applying MAXDOP 1 to the second query to make the figures more easily comparable returns the following.

Query 1

Table 'Worktable'. Scan count 0, logical reads 0
Table 'tasks'. Scan count 5000, logical reads 78137
Table 'events'. Scan count 1, logical reads 719
   CPU time = 3167 ms,  elapsed time = 4497 ms.

Query 2

Table 'tasks'. Scan count 5000, logical reads 49440
Table 'events'. Scan count 1, logical reads 719
   CPU time = 3042 ms,  elapsed time = 3147 ms.

Query 1 had an estimated 5006.73 rows coming out of the join and the hash aggregate spilled to tempdb again.

Query 2 again has a large overestimate (at 120,927,000 this time).

The other obvious difference between the two results is that the range query looks like it manages to seek into tasks more efficiently in some way. Only reading 49,440 pages vs 78,137.

The range that the cast as date version seeks into is derived from an internal function GetRangeThroughConvert. The plan shows a residual predicate on CONVERT(date,[dbo].[tasks].[datetimecolumn],0)= [dbo].[events].[datecolumn].

If Query 2 is altered to

   LEFT OUTER JOIN tasks
     ON tasks.datetimecolumn > DATEADD(day, -1, events.datecolumn)
        AND tasks.datetimecolumn < DATEADD(day, 1, events.datecolumn)

Then the number of reads then becomes the same. The dynamic seek used by the CAST AS DATE version reads unnecessary rows (two days worth rather than one) and then discards them with the residual predicate.

One other possibility would be to restructure the table to store the date and time components in different columns.

CREATE TABLE [dbo].[tasks](
    [taskId] [int] IDENTITY(1,1) NOT NULL,
    [datecolumn] date NOT NULL,
    [timecolumn] time NOT NULL,
    [datetimecolumn]  AS DATEADD(day, DATEDIFF(DAY,0,[datecolumn]), CAST([timecolumn] AS DATETIME2(7))),
    [details] [char](1000) NULL,
PRIMARY KEY CLUSTERED 
(
    [datecolumn] ASC,
    [timecolumn] ASC
))

The datetimecolumn can be derived from the component parts and this has no effect on row size (as width of date + time(n) is the same as the width of datetime2(n)). (With an exception being if the additional column increases the size of the NULL_BITMAP)

The query is then a straight forward = predicate

SELECT events.eventId,
       MAX(tasks.details)
FROM   events
       LEFT OUTER JOIN tasks
         ON tasks.datecolumn = events.datecolumn
GROUP  BY events.eventId

This would allow a merge join between the tables without any need to sort. Though for these table sizes a nested loops join was chosen anyway with stats as below.

Table 'tasks'. Scan count 5000, logical reads 44285
Table 'events'. Scan count 1, logical reads 717
   CPU time = 2980 ms,  elapsed time = 3012 ms.

As well as potentially allowing different logical join types storing the date separately as the leading index column would also potentially benefit other queries on tasks such as grouping by date.

As for why the = predicate shows fewer logical reads on tasks than the > <= version with the same nested loops plan (44,285 vs 49,440) this appears to be related to the read ahead mechanism.

Turning on trace flag 652 reduces the logical reads of the range version to the same as that of the equals version.