Sql-server – SARGable WHERE clause for two date columns

indexperformancequery-performancesql server

I have what is, to me, an interesting question on SARGability. In this case, it's about using a predicate on the difference between two date columns. Here's the setup:

USE [tempdb]
SET NOCOUNT ON  

IF OBJECT_ID('tempdb..#sargme') IS NOT NULL
BEGIN
DROP TABLE #sargme
END

SELECT TOP 1000
IDENTITY (BIGINT, 1,1) AS ID,
CAST(DATEADD(DAY, [m].[severity] * -1, GETDATE()) AS DATE) AS [DateCol1],
CAST(DATEADD(DAY, [m].[severity], GETDATE()) AS DATE) AS [DateCol2]
INTO #sargme
FROM sys.[messages] AS [m]

ALTER TABLE [#sargme] ADD CONSTRAINT [pk_whatever] PRIMARY KEY CLUSTERED ([ID])
CREATE NONCLUSTERED INDEX [ix_dates] ON [#sargme] ([DateCol1], [DateCol2])

What I'll see pretty frequently, is something like this:

/*definitely not sargable*/
SELECT
    * ,
    DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2])
FROM
    [#sargme] AS [s]
WHERE
    DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2]) >= 48;

…which definitely isn't SARGable. It results in an index scan, reads all 1000 rows, no good. Estimated rows stink. You'd never put this in production.

It would be nice if we could materialize CTEs, because that would help us make this, well, more SARGable-er, technically speaking. But no, we get the same execution plan as up top.

/*would be nice if it were sargable*/
WITH    [x] AS ( SELECT
                * ,
                DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2]) AS [ddif]
               FROM
                [#sargme] AS [s])
     SELECT
        *
     FROM
        [x]
     WHERE
        [x].[ddif] >= 48;

And of course, since we are not using constants, this code changes nothing, and is not even half SARGable. No fun. Same execution plan.

/*not even half sargable*/
SELECT
    * ,
    DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2])
FROM
    [#sargme] AS [s]
WHERE
    [s].[DateCol2] >= DATEADD(DAY, 48, [s].[DateCol1])

If you're feeling lucky, and you're obeying all the ANSI SET options in your connection strings, you could add a computed column, and search on it…

ALTER TABLE [#sargme] ADD [ddiff] AS 
DATEDIFF(DAY, DateCol1, DateCol2) PERSISTED

CREATE NONCLUSTERED INDEX [ix_dates2] ON [#sargme] ([ddiff], [DateCol1], [DateCol2])

SELECT [s].[ID] ,
       [s].[DateCol1] ,
       [s].[DateCol2]
FROM [#sargme] AS [s]
WHERE [ddiff] >= 48

This will get you an index seek with three queries. The odd man out is where we add 48 days to DateCol1. The query with DATEDIFF in the WHERE clause, the CTE, and the final query with a predicate on the computed column all give you a much nicer plan with much nicer estimates, and all that.

Which brings me to the question: in a single query, is there a SARGable way to perform this search?

No temp tables, no table variables, no altering the table structure, and no views.

I'm fine with self-joins, CTEs, subqueries, or multiple passes over the data. Can work with any version of SQL Server.

Avoiding the computed column is an artificial limitation because I'm more interested in a query solution than anything else.

Best Answer

Just adding this quickly so it exists as an answer (though I know it's not the answer you want).

An indexed computed column is usually the right solution for this type of problem.

It:

makes the predicate an indexable expression
allows automatic statistics to be created for better cardinality estimation
does not need to take any space in the base table

To be clear on that last point, the computed column is not required to be persisted in this case:

-- Note: not PERSISTED, metadata change only
ALTER TABLE #sargme
ADD DayDiff AS DATEDIFF(DAY, DateCol1, DateCol2);

-- Index the expression
CREATE NONCLUSTERED INDEX index_name
ON #sargme (DayDiff)
INCLUDE (DateCol1, DateCol2);

Now the query:

SELECT
    S.ID,
    S.DateCol1,
    S.DateCol2,
    DATEDIFF(DAY, S.DateCol1, S.DateCol2)
FROM
    #sargme AS S
WHERE
    DATEDIFF(DAY, S.DateCol1, S.DateCol2) >= 48;

...gives the following trivial plan:

As Martin Smith said, if you have connections using the wrong set options, you could create a regular column and maintain the computed value using triggers.

All this only really matters (code challenge aside) if there's a real problem to solve, of course, as Aaron says in his answer.

This is fun to think about, but I don't know any way to achieve what you want reasonably given the constraints in the question. It seems like any optimal solution would require a new data structure of some type; the closest we have being the 'function index' approximation provided by an index on a non-persisted computed column as above.

Related Solutions

Sql-server – How to count rows with two properties in one index scan in SQL

Group by your condition

SELECT
    ItemCreationDate,
    COUNT(ItemId) AS ComputedCount,
    CASE WHEN ItemUnitsCount = 0 THEN 2 ELSE 1 END AS CountType
FROM
    Items
GROUP BY
    ItemCreationDate,
    CASE WHEN ItemUnitsCount = 0 THEN 2 ELSE 1 END;

Sql-server – How to avoid using variables in WHERE clause

Parameter sniffing is your friend almost all of the time and you should write your queries so that it can be used. Parameter sniffing helps building the plan for you using the parameter values available when the query is compiled. The dark side of parameter sniffing is when the values used when compiling the query is not optimal for the queries to come.

The query in a stored procedure is compiled when the stored procedure is executed, not when the query is executed so the values that SQL Server has to deal with here...

CREATE PROCEDURE WeeklyProc(@endDate DATE)
AS
BEGIN
  DECLARE @startDate DATE = DATEADD(DAY, -6, @endDate)
  SELECT
    -- Stuff
  FROM Sale
  WHERE SaleDate BETWEEN @startDate AND @endDate
END

is a known value for @endDate and an unknown value for @startDate. That will leave SQL Server to guessing on 30% of the rows returned for the filter on @startDate combined with whatever the statistics tells it for @endDate. If you have a big table with a lot of rows that could give you a scan operation where you would benefit most from a seek.

Your wrapper procedure solution makes sure that SQL Server sees the values when DateRangeProc is compiled so it can use known values for both @endDate and @startDate.

Both your dynamic queries leads to the same thing, the values are known at compile-time.

The one with a default null value is a bit special. The values known to SQL Server at compile-time is a known value for @endDate and null for @startDate. Using a null in a between will give you 0 rows but SQL Server always guess at 1 in those cases. That might be a good thing in this case but if you call the stored procedure with a large date interval where a scan would have been the best choice it may end up doing a bunch of seeks.

I left "Use the DATEADD() function directly" to the end of this answer because it is the one I would use and there is something strange with it as well.

First off, SQL Server does not call the function multiple times when it is used in the where clause. DATEADD is considered runtime constant.

And I would think that DATEADD is evaluated when the query is compiled so that you would get a good estimate on the number of rows returned. But it is not so in this case.
SQL Server estimates based on the value in the parameter regardless of what you do with DATEADD (tested on SQL Server 2012) so in your case the estimate will be the number of rows that is registered on @endDate. Why it does that I don't know but it has to do with the use of the datatype DATE. Shift to DATETIME in the stored procedure and the table and the estimate will be accurate, meaning that DATEADD is considered at compile time for DATETIME not for DATE.

So to summarize this rather lengthy answer I would recommend the wrapper procedure solution. It will always allow SQL Server to use the values provided when compiling the the query without the hassle of using dynamic SQL.

PS:

In comments you got two suggestions.

OPTION (OPTIMIZE FOR UNKNOWN) will give you an estimate of 9% of rows returned and OPTION (RECOMPILE) will make SQL Server see the parameter values since the query is recompiled every time.

Best Answer

Related Solutions

Sql-server – How to count rows with two properties in one index scan in SQL

Sql-server – How to avoid using variables in WHERE clause

Related Question