Sql-server – Best Way to Index Multiple Effective Date Columns

indexindex-tuningsql serversql-server-2016t-sql

I've got a table that is primarily to represent relationships between entities (i.e. mostly composed of foreign keys). These relationships change over time, and so the table has a StartDate and an EndDate column. I now need to add another dimension of start and end dates, meaning that the relationship can be viewed with two different date "lenses" (queried with two dates, @Date1 and @Date2), so the schema will look something like this:

MyJoinTable:

| Id | Entity1Id | Entity2Id | StartDate1 |  EndDate1  | StartDate2 |  EndDate2  |
|----|-----------|-----------|------------|------------|------------|------------|
|  1 | A         | B         | 1753-01-01 | 2018-09-01 | 1753-01-01 | 2025-01-01 |
|  2 | A         | B         | 2018-09-01 | 2018-10-01 | 1753-01-01 | 2018-11-01 |
|  3 | A         | C         | 2018-09-01 | 2018-10-01 | 2018-11-01 | 2025-01-01 |
|  4 | A         | B         | 2018-10-01 | 2025-01-01 | 1753-01-01 | 2018-11-01 |
|  5 | A         | D         | 2018-10-01 | 2025-10-01 | 2018-11-01 | 2025-01-01 |

The queries will mostly be joins to this table, for example:

SELECT e1.Field, e2.Field
FROM Entity1 e1
INNER JOIN MyJoinTable jt ON jt.Entity1Id = e1.Id
    AND StartDate1 <= @Date1 AND EndDate1 > @Date1
    AND StartDate2 <= @Date2 and EndDate2 > @Date2
INNER JOIN Entity2 e2 ON e2.Id = jt.Entity2Id

My questions are:

  1. What's the best way to index this join table?
    • Index on Entity1Id
    • Index on Entity2Id
    • Composite Index on all four date columns? (StartDate1, EndDate1, StartDate2, EndDate2)
  2. What's the best way to constrain the DB so that I only get one relationship row returned for any @Date1, @Date2 combination?
  3. Do you have any recommendations on a better data model?

Best Answer

Introduction

I do not have the time for extensive testing, but can suggest from where to start.

If you rewrite the query in a more symmetrical manner, to emphasise that both entities are joined to a two-dimensional cross-section of MyJoinTable:

SELECT E1.Field, E2.Field
FROM MyJoinTable JT
JOIN Entity1 E1 ON E1.Id = JT.Entity1Id
JOIN Entity2 E2 ON E2.Id = JT.Entity2Id
WHERE
    StartDate1 <= @date1 AND EndDate1 > @date1 AND
    StartDate2 <= @date2 AND EndDate2 > @date2

you will see that one reasonably efficient way of executing it is first to extract that cross-section and then to join the entities to it. The following equivalent query will serve for illustration purposes (but do not use it!):

-- A sample to demostrate the desired order or execution:
SELECT E1.Field, E2.Field
FROM
-- 1. Calculate the cross-section:
(   SELECT JT.Entity1Id, JT.Entity2Id
    FROM MyJoinTable JT
    WHERE 
        StartDate1 <= @date1 AND EndDate1 > @date1 AND
        StartDate2 <= @date2 AND EndDate2 > @date2
) SECT
-- 2. Join the entity tables:
JOIN Entity1 E1 ON E1.Id = SECT.Entity1Id
JOIN Entity2 E2 ON E2.Id = SECT.Entity2Id

It is the more important since you say that there is only

one relationship row returned for any @Date1, @Date2 combination.

For this reason, joining the entities in before MyJoinTable is reduced to the required cross-section by @date1 and @date2 is inefficient as it is likely to have many records for each entity so that the result will have too many rows. The entites are best joined at the end, using their natural key field, Id, which I assume is the clustered index. The solutions below, therefore, propose different ways of calculating this cross-section, corresponding to the SECT subquery of the example above—

Soution I: The easiest

Let us try to add useful indices to the original query. Since MSSQL's composite indices are hierarchical, they are useless in optimising interval comparisons, so the best we can do with the given structure is to index one of the date fields, yet make sure to cover all the other fields required form the table:

CREATE NONCLUSTERED INDEX SD1 ON MyJoinTable ( StartDate1 )
INCLUDE (Entity1Id, Entity2Id, StartDate2, EndDate1, EndDate2 )

The query then will be executed in the following order:

  1. Using the index SD1, perform an index seek on MyJoinTable to find records by StartDate1, and filter them by StartDate2, EndDate1, and EndDate2 in the residual predicate

  2. Join the entity tables using their natural key Id.

This method is not very efficient because only one of the four date predicates is fully optimised by an index, resudual predicates being boring data grinders.

Solution II: Set intersection

Another symmetrical way to obtain the two-dimensional JOIN cross-section goes through INTERSECTing the results of the four date predicates:

SELECT Entity1Id, Entity2Id
FROM
(   SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date1 >= StartDate1
    INTERSECT
    SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date2 >= StartDate2
    INTERSECT
    SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date1 <  EndDate1
    INTERSECT
    SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date2 <  EndDate2 )
SECT

With the suitable indices,

CREATE NONCLUSTERED INDEX SD1 ON MyJoinTable (StartDate1)
    INCLUDE (Id, Entity1Id, Entity2Id)
CREATE NONCLUSTERED INDEX SD2 ON MyJoinTable (StartDate2)
    INCLUDE (Id, Entity1Id, Entity2Id)
CREATE NONCLUSTERED INDEX ED1 ON MyJoinTable (EndDate1  )
    INCLUDE (Id, Entity1Id, Entity2Id)
CREATE NONCLUSTERED INDEX ED2 ON MyJoinTable (EndDate2  )
    INCLUDE (Id, Entity1Id, Entity2Id)

the plan for this query includes only clean index seek (wihout residual predicates) and hash match operations.

Observe that the entity keys in query and indices present a fourfold redundancy, which may be removed—at the expense of a more complicated execution plan—by joining the entities in separately:

SELECT JT.Entity1Id, JT.Entity2Id
FROM
(   SELECT Id FROM MyJoinTable WHERE @date1 >= StartDate1
    INTERSECT
    SELECT Id FROM MyJoinTable WHERE @date2 >= StartDate2
    INTERSECT
    SELECT Id FROM MyJoinTable WHERE @date1 <  EndDate1
    INTERSECT
    SELECT Id FROM MyJoinTable WHERE @date2 <  EndDate2 )
SECT
JOIN MyJoinTable JT ON JT.Id = SECT.Id

with these indices:

CREATE NONCLUSTERED INDEX SD1 ON MyJoinTable (StartDate1) INCLUDE (Id)
CREATE NONCLUSTERED INDEX SD2 ON MyJoinTable (StartDate2) INCLUDE (Id)
CREATE NONCLUSTERED INDEX ED1 ON MyJoinTable (EndDate1  ) INCLUDE (Id)
CREATE NONCLUSTERED INDEX ED2 ON MyJoinTable (EndDate2  ) INCLUDE (Id)

But since in either case each of four date predicates, bounding the date on one end only, does not reduce the amount of data sufficiently, hash matches may have to spill data into tempdb. If they do, this method is not fit for your environment.

Solution III: A compromise

Now we can merge solutions I & II in order to come up with a plan that does not require so much RAM and at the same time is reasonably fast:

SELECT Entity1Id, Entity2Id FROM
(   SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date1 >= StartDate1 AND @date1 < EndDate1
    INTERSECT
    SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date2 >= StartDate2 AND @date2 < EndDate2  )
SECT

Now the date constraints are more efficient in discaring data because they bound the date on both ends. With the indices below:

CREATE NONCLUSTERED INDEX SE1 ON MyJoinTable (StartDate1)
    INCLUDE (EndDate1, Entity1Id, Entity2Id)
CREATE NONCLUSTERED INDEX SE2 ON MyJoinTable (StartDate2)
    INCLUDE (EndDate2, Entity1Id, Entity2Id)

each constraint uses an index seek with a residual predicate, which is better than the single index seek in solution I and takes less RAM than solution II. The plan shows three opportunities of parallisation: one for each date constraint, and one for the INTERSECT operation.

A non-redundant version of this approach would be:

SELECT JT.Entity1Id, JT.Entity2Id FROM
(   SELECT Id FROM MyJoinTable
    WHERE @date1 >= StartDate1 AND @date1 < EndDate1
    INTERSECT
    SELECT Id FROM MyJoinTable
    WHERE @date2 >= StartDate2 AND @date2 < EndDate2  )
SECT
JOIN MyJoinTable JT ON JT.Id = SECT.Id

with indices

CREATE NONCLUSTERED INDEX SE1 ON MyJoinTable (StartDate1)
    INCLUDE (EndDate1, Id)
CREATE NONCLUSTERED INDEX SE2 ON MyJoinTable (StartDate2)
    INCLUDE (EndDate2, Id)

Although it should work better with your data, where the SECT subquery returns at most one row, in my crude tests with randomly generated data it has been less efficient because the server used a hash match instead of a nested loop join with that single row. Do try it on your side.

Solution IV: Optimised structure

It is possible to optimise your query at the expense of introducing a more complicated structure that requires additional maintenance and slows down the modification of data in MyJoinTable. If you are willing pay the price, store the date ranges as sets of days:

CREATE TABLE MyJoinTable
(   Id         INT IDENTITY(1,1),
    Entity1Id  INT,
    Entity2Id  INT,
    Range1     INT, -- reference to Ranges.Id 
    Range2     INT  -- reference to Ranges.Id
)

CREATE TABLE Ranges
(   Id    INT,
    Date  Date
)

and query the relation thus:

SELECT E1.Field, E2.Field
FROM MyJoinTable JT
JOIN Entity1 E1  ON E1.Id = JT.Entity1Id
JOIN Entity2 E2  ON E2.Id = JT.Entity2Id
JOIN Ranges  R1  ON R1.Id = JT.Range1
JOIN Ranges  R2  ON R2.Id = JT.Range2
WHERE
    R1.Date = @date1 AND
    R2.Date = @date2

You will need some testing in order to determine optimal indices, but I think the following should work:

CREATE NONCLUSTERED INDEX RD ON Ranges      ( Date   ) INCLUDE ( id )
CREATE NONCLUSTERED INDEX RR ON MyJoinTable ( Range1, Range2 )
-- optionally: INCLUDE (Entity1Id, Entity2Id)

Index RD will make sure the ranges are quickly found corresponding to the specified dates, and index RR will help to find the record matching these ranges. But then you shall have to devise a means of filling the Ranges table and keeping it in sync with MyJoinTable, because doing so by hand is out of the question.