Most Efficient Way to Retrieve Date Ranges in SQL Server

performancequery-performancesql server

What's the most efficient way to retrieve date ranges with a table structure like this?

create table SomeDateTable
(
    id int identity(1, 1) not null,
    StartDate datetime not null,
    EndDate datetime not null
)
go

Say you want a range for both StartDate and EndDate. So in other words, if StartDate falls in between @StartDateBegin and @StartDateEnd, and EndDate falls in between @EndDateBegin and @EndDateEnd, then do something.

I know there are a few ways to probably go about this, but what is the most advised?

Best Answer

This is a hard problem to solve in general, but there are a couple of things we can do to help the optimizer choose a plan. This script creates a table with 10,000 rows with a known pseudo-random distribution of rows to illustrate:

CREATE TABLE dbo.SomeDateTable
(
    Id          INTEGER IDENTITY(1, 1) PRIMARY KEY NOT NULL,
    StartDate   DATETIME NOT NULL,
    EndDate     DATETIME NOT NULL
);
GO
SET STATISTICS XML OFF
SET NOCOUNT ON;
DECLARE
    @i  INTEGER = 1,
    @s  FLOAT = RAND(20120104),
    @e  FLOAT = RAND();

WHILE @i <= 10000
BEGIN
    INSERT dbo.SomeDateTable
        (
        StartDate, 
        EndDate
        )
    VALUES
        (
        DATEADD(DAY, @s * 365, {d '2009-01-01'}),
        DATEADD(DAY, @s * 365 + @e * 14, {d '2009-01-01'})
        )

    SELECT
        @s = RAND(),
        @e = RAND(),
        @i += 1
END

The first question is how to index this table. One option is to provide two indexes on the DATETIME columns, so the optimizer can at least choose whether to seek on StartDate or EndDate.

CREATE INDEX nc1 ON dbo.SomeDateTable (StartDate, EndDate)
CREATE INDEX nc2 ON dbo.SomeDateTable (EndDate, StartDate)

Naturally, the inequalities on both StartDate and EndDate mean that only one column in each index can support a seek in the example query, but this is about the best we can do. We might consider making the second column in each index an INCLUDE rather than a key, but we might have other queries that can perform an equality seek on the leading column and an inequality seek on the second column. Also, we may get better statistics this way. Anyway...

DECLARE
    @StartDateBegin DATETIME = {d '2009-08-01'},
    @StartDateEnd DATETIME = {d '2009-10-15'},
    @EndDateBegin DATETIME = {d '2009-08-05'},
    @EndDateEnd DATETIME = {d '2009-10-22'}

SELECT
    COUNT_BIG(*)
FROM dbo.SomeDateTable AS sdt
WHERE
    sdt.StartDate BETWEEN @StartDateBegin AND @StartDateEnd
    AND sdt.EndDate BETWEEN @EndDateBegin AND @EndDateEnd

This query uses variables, so in general the optimizer will guess at selectivity and distribution, resulting in a guessed cardinality estimate of 81 rows. In fact, the query produces 2076 rows, a discrepancy that might be important in a more complex example.

On SQL Server 2008 SP1 CU5 or later (or R2 RTM CU1) we can take advantage of the Parameter Embedding Optimization to get better estimates, simply by adding OPTION (RECOMPILE) to the SELECT query above. This causes a compilation just before the batch executes, allowing SQL Server to 'see' the real parameter values and optimize for those. With this change, the estimate improves to 468 rows (though you do need to check the runtime plan to see this). This estimate is better than 81 rows, but still not all that close. The modelling extensions enabled by trace flag 2301 may help in some cases, but not with this query.

The problem is where the rows qualified by the two range searches overlap. One of the simplifying assumptions made in the optimizer's costing and cardinality estimation component is that predicates are independent (so if both have a selectivity of 50%, the result of applying both is assumed to qualify 50% of 50% = 25% of the rows). Where this sort of correlation is a problem, we can often work around it with multi-column and/or filtered statistics. With two ranges with unknown start and end points, this becomes impractical. This is where we sometimes have to resort to rewriting the query to a form that happens to produce a better estimate:

SELECT COUNT(*) FROM
(
    SELECT
        sdt.Id
    FROM dbo.SomeDateTable AS sdt
    WHERE 
        sdt.StartDate BETWEEN @StartDateBegin AND @StartDateEnd
    INTERSECT
    SELECT
        sdt.Id
    FROM dbo.SomeDateTable AS sdt 
    WHERE
        sdt.EndDate BETWEEN @EndDateBegin AND @EndDateEnd
) AS intersected (id)
OPTION (RECOMPILE)

This form happens to produce a runtime estimate of 2110 rows (versus 2076 actual). Unless you have TF 2301 on, in which case the more advanced modelling techniques see through the trick and produce exactly the same estimate as before: 468 rows.

One day SQL Server might gain native support for intervals. If that comes with good statistical support, developers might dread tuning query plans like this a little less.

Related Solutions

Efficient way to perform approximated search

I think that what you need to use is a Nonequi join

select *
from
    table_a,
    table_b
where
    table_a.id1 = table_b.id1
    and table_a.id2 = table_b.id2
    and table_a.evnt_sec BETWEEN table_b.evnt_sec -2 and table_b.evnt_sec +2

Sql-server – the best way to get all data for a date range, plus the last event just before the range

I am going to assume that there isn't an index on the date columns, otherwise I think that the query would have been structured differently. If there is, you can probably find a better performing one than this.

The advantage of this query is that it can get all the data in one scan. The disadvantage is that it has to sort the data and join EventEmployee on the entire table. So as always, test with your own situation. This query also assumes that the MAX date is either unique or that equivalent rows would be acceptable.

USE AdventureWorks2012
GO
;
WITH Base AS (
   SELECT 
      TransactionHistory.*
      ,ProductVendor.BusinessEntityID
      ,MAX(CASE WHEN TransactionDate < '2008-08-01' THEN TransactionDate END) 
           OVER (PARTITION BY ProductVendor.BusinessEntityID) AS PreviousVendorTransaction
      ,COUNT(CASE WHEN TransactionDate >= '2008-08-01' THEN 1 END ) 
           OVER (PARTITION BY ProductVendor.BusinessEntityID) AS VendorAfterCutoff
   FROM
      Production.TransactionHistory
      -- Doesn't make the most sense, but I need a repeating relation
      INNER JOIN Purchasing.ProductVendor
         ON TransactionHistory.ProductID = ProductVendor.ProductID
),
Filtered AS (
   SELECT
      *
   FROM
      Base
   WHERE
      Base.TransactionDate >= '2008-08-01'
      OR (TransactionDate = PreviousVendorTransaction AND VendorAfterCutoff > 0)
)
SELECT DISTINCT
   TransactionID
   ,ProductID
   ,ReferenceOrderID
   ,ReferenceOrderLineID
   ,TransactionDate
   ,TransactionType
   ,Quantity
   ,ActualCost
   ,ModifiedDate
FROM
   Filtered

Edit:

Hmm, I think I may have to take back my comment on structuring it differently if there are indexes. The other suggestions that I have are probably fairly minor.

Make sure the query is using the indexes you're expecting it to. Start and End date to build temp table, end date to drive the previous event loop.
If the query to build the temp table is doing a lookup on the clustered index, it may be better to hold off and do that as part of the main query.
Try using a cte instead of a temp table. I think that a cte might be more competitive with the way that the query is structured below.
If you are returning a lot of events, it might be better to pull out the event table lookup to the main query to give the optimizer the option of doing a merge join.
I don't see a way of optimizing the previous event lookup short of an indexed view.

Here's a query that combines a few of those ideas.

SELECT
    e.[EventID]
INTO #EventTemp
FROM
    [Events] AS e
WHERE
    ( e.[EventStart] >= @StartDate AND e.[EventStart] <= @EndDate )
    OR ( e.[EventEnd] >= @StartDate AND e.[EventEnd] <= @EndDate )

;
WITH PrevEvent AS (
    SELECT
        EmpPrevEvent.[EventID]
    FROM
    (
        SELECT DISTINCT
            ee.[EmployeeID]
        FROM
            #EventTemp
            INNER JOIN [EventEmployee] AS ee ON
                #EventTemp.[EventID] = ee.[EventID]
    ) AS Emp
    CROSS APPLY (
        SELECT TOP 1
            e.[EventID]
        FROM
            [Events] AS e
            INNER JOIN [EventEmployee] AS ee ON
                e.[EventID] = ee.[EventID]
        WHERE
            ee.[EmployeeID] = Emp.[EmployeeID]
            AND e.[EventEnd] < @StartDate
        ORDER BY 
            e.[EventEnd] DESC
    ) AS EmpPrevEvent
)
SELECT
    e.[EventID],
    e.[EventStart],
    e.[EventEnd],
    e.[EventTypeID]
FROM
    [Events] AS e
WHERE
    e.EventID IN (
        SELECT EventID
        FROM #EventTemp
        UNION
        SELECT EventID
        FROM PrevEvent
    )

Best Answer

Related Solutions

Efficient way to perform approximated search

Sql-server – the best way to get all data for a date range, plus the last event just before the range

Related Question