Sql-server – Best Way to Index Multiple Effective Date Columns

indexindex-tuningsql serversql-server-2016t-sql

I've got a table that is primarily to represent relationships between entities (i.e. mostly composed of foreign keys). These relationships change over time, and so the table has a StartDate and an EndDate column. I now need to add another dimension of start and end dates, meaning that the relationship can be viewed with two different date "lenses" (queried with two dates, @Date1 and @Date2), so the schema will look something like this:

MyJoinTable:

| Id | Entity1Id | Entity2Id | StartDate1 |  EndDate1  | StartDate2 |  EndDate2  |
|----|-----------|-----------|------------|------------|------------|------------|
|  1 | A         | B         | 1753-01-01 | 2018-09-01 | 1753-01-01 | 2025-01-01 |
|  2 | A         | B         | 2018-09-01 | 2018-10-01 | 1753-01-01 | 2018-11-01 |
|  3 | A         | C         | 2018-09-01 | 2018-10-01 | 2018-11-01 | 2025-01-01 |
|  4 | A         | B         | 2018-10-01 | 2025-01-01 | 1753-01-01 | 2018-11-01 |
|  5 | A         | D         | 2018-10-01 | 2025-10-01 | 2018-11-01 | 2025-01-01 |

The queries will mostly be joins to this table, for example:

SELECT e1.Field, e2.Field
FROM Entity1 e1
INNER JOIN MyJoinTable jt ON jt.Entity1Id = e1.Id
    AND StartDate1 <= @Date1 AND EndDate1 > @Date1
    AND StartDate2 <= @Date2 and EndDate2 > @Date2
INNER JOIN Entity2 e2 ON e2.Id = jt.Entity2Id

My questions are:

What's the best way to index this join table?
- Index on Entity1Id
- Index on Entity2Id
- Composite Index on all four date columns? (StartDate1, EndDate1, StartDate2, EndDate2)
What's the best way to constrain the DB so that I only get one relationship row returned for any @Date1, @Date2 combination?
Do you have any recommendations on a better data model?

Best Answer

Introduction

I do not have the time for extensive testing, but can suggest from where to start.

If you rewrite the query in a more symmetrical manner, to emphasise that both entities are joined to a two-dimensional cross-section of MyJoinTable:

SELECT E1.Field, E2.Field
FROM MyJoinTable JT
JOIN Entity1 E1 ON E1.Id = JT.Entity1Id
JOIN Entity2 E2 ON E2.Id = JT.Entity2Id
WHERE
    StartDate1 <= @date1 AND EndDate1 > @date1 AND
    StartDate2 <= @date2 AND EndDate2 > @date2

you will see that one reasonably efficient way of executing it is first to extract that cross-section and then to join the entities to it. The following equivalent query will serve for illustration purposes (but do not use it!):

-- A sample to demostrate the desired order or execution:
SELECT E1.Field, E2.Field
FROM
-- 1. Calculate the cross-section:
(   SELECT JT.Entity1Id, JT.Entity2Id
    FROM MyJoinTable JT
    WHERE 
        StartDate1 <= @date1 AND EndDate1 > @date1 AND
        StartDate2 <= @date2 AND EndDate2 > @date2
) SECT
-- 2. Join the entity tables:
JOIN Entity1 E1 ON E1.Id = SECT.Entity1Id
JOIN Entity2 E2 ON E2.Id = SECT.Entity2Id

It is the more important since you say that there is only

one relationship row returned for any @Date1, @Date2 combination.

For this reason, joining the entities in before MyJoinTable is reduced to the required cross-section by @date1 and @date2 is inefficient as it is likely to have many records for each entity so that the result will have too many rows. The entites are best joined at the end, using their natural key field, Id, which I assume is the clustered index. The solutions below, therefore, propose different ways of calculating this cross-section, corresponding to the SECT subquery of the example above—

Soution I: The easiest

Let us try to add useful indices to the original query. Since MSSQL's composite indices are hierarchical, they are useless in optimising interval comparisons, so the best we can do with the given structure is to index one of the date fields, yet make sure to cover all the other fields required form the table:

CREATE NONCLUSTERED INDEX SD1 ON MyJoinTable ( StartDate1 )
INCLUDE (Entity1Id, Entity2Id, StartDate2, EndDate1, EndDate2 )

The query then will be executed in the following order:

Using the index SD1, perform an index seek on MyJoinTable to find records by StartDate1, and filter them by StartDate2, EndDate1, and EndDate2 in the residual predicate
Join the entity tables using their natural key Id.

This method is not very efficient because only one of the four date predicates is fully optimised by an index, resudual predicates being boring data grinders.

Solution II: Set intersection

Another symmetrical way to obtain the two-dimensional JOIN cross-section goes through INTERSECTing the results of the four date predicates:

SELECT Entity1Id, Entity2Id
FROM
(   SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date1 >= StartDate1
    INTERSECT
    SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date2 >= StartDate2
    INTERSECT
    SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date1 <  EndDate1
    INTERSECT
    SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date2 <  EndDate2 )
SECT

With the suitable indices,

CREATE NONCLUSTERED INDEX SD1 ON MyJoinTable (StartDate1)
    INCLUDE (Id, Entity1Id, Entity2Id)
CREATE NONCLUSTERED INDEX SD2 ON MyJoinTable (StartDate2)
    INCLUDE (Id, Entity1Id, Entity2Id)
CREATE NONCLUSTERED INDEX ED1 ON MyJoinTable (EndDate1  )
    INCLUDE (Id, Entity1Id, Entity2Id)
CREATE NONCLUSTERED INDEX ED2 ON MyJoinTable (EndDate2  )
    INCLUDE (Id, Entity1Id, Entity2Id)

the plan for this query includes only clean index seek (wihout residual predicates) and hash match operations.

Observe that the entity keys in query and indices present a fourfold redundancy, which may be removed—at the expense of a more complicated execution plan—by joining the entities in separately:

SELECT JT.Entity1Id, JT.Entity2Id
FROM
(   SELECT Id FROM MyJoinTable WHERE @date1 >= StartDate1
    INTERSECT
    SELECT Id FROM MyJoinTable WHERE @date2 >= StartDate2
    INTERSECT
    SELECT Id FROM MyJoinTable WHERE @date1 <  EndDate1
    INTERSECT
    SELECT Id FROM MyJoinTable WHERE @date2 <  EndDate2 )
SECT
JOIN MyJoinTable JT ON JT.Id = SECT.Id

with these indices:

CREATE NONCLUSTERED INDEX SD1 ON MyJoinTable (StartDate1) INCLUDE (Id)
CREATE NONCLUSTERED INDEX SD2 ON MyJoinTable (StartDate2) INCLUDE (Id)
CREATE NONCLUSTERED INDEX ED1 ON MyJoinTable (EndDate1  ) INCLUDE (Id)
CREATE NONCLUSTERED INDEX ED2 ON MyJoinTable (EndDate2  ) INCLUDE (Id)

But since in either case each of four date predicates, bounding the date on one end only, does not reduce the amount of data sufficiently, hash matches may have to spill data into tempdb. If they do, this method is not fit for your environment.

Solution III: A compromise

Now we can merge solutions I & II in order to come up with a plan that does not require so much RAM and at the same time is reasonably fast:

SELECT Entity1Id, Entity2Id FROM
(   SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date1 >= StartDate1 AND @date1 < EndDate1
    INTERSECT
    SELECT Id, Entity1Id, Entity2Id FROM MyJoinTable
    WHERE @date2 >= StartDate2 AND @date2 < EndDate2  )
SECT

Now the date constraints are more efficient in discaring data because they bound the date on both ends. With the indices below:

CREATE NONCLUSTERED INDEX SE1 ON MyJoinTable (StartDate1)
    INCLUDE (EndDate1, Entity1Id, Entity2Id)
CREATE NONCLUSTERED INDEX SE2 ON MyJoinTable (StartDate2)
    INCLUDE (EndDate2, Entity1Id, Entity2Id)

each constraint uses an index seek with a residual predicate, which is better than the single index seek in solution I and takes less RAM than solution II. The plan shows three opportunities of parallisation: one for each date constraint, and one for the INTERSECT operation.

A non-redundant version of this approach would be:

SELECT JT.Entity1Id, JT.Entity2Id FROM
(   SELECT Id FROM MyJoinTable
    WHERE @date1 >= StartDate1 AND @date1 < EndDate1
    INTERSECT
    SELECT Id FROM MyJoinTable
    WHERE @date2 >= StartDate2 AND @date2 < EndDate2  )
SECT
JOIN MyJoinTable JT ON JT.Id = SECT.Id

with indices

CREATE NONCLUSTERED INDEX SE1 ON MyJoinTable (StartDate1)
    INCLUDE (EndDate1, Id)
CREATE NONCLUSTERED INDEX SE2 ON MyJoinTable (StartDate2)
    INCLUDE (EndDate2, Id)

Although it should work better with your data, where the SECT subquery returns at most one row, in my crude tests with randomly generated data it has been less efficient because the server used a hash match instead of a nested loop join with that single row. Do try it on your side.

Solution IV: Optimised structure

It is possible to optimise your query at the expense of introducing a more complicated structure that requires additional maintenance and slows down the modification of data in MyJoinTable. If you are willing pay the price, store the date ranges as sets of days:

CREATE TABLE MyJoinTable
(   Id         INT IDENTITY(1,1),
    Entity1Id  INT,
    Entity2Id  INT,
    Range1     INT, -- reference to Ranges.Id 
    Range2     INT  -- reference to Ranges.Id
)

CREATE TABLE Ranges
(   Id    INT,
    Date  Date
)

and query the relation thus:

SELECT E1.Field, E2.Field
FROM MyJoinTable JT
JOIN Entity1 E1  ON E1.Id = JT.Entity1Id
JOIN Entity2 E2  ON E2.Id = JT.Entity2Id
JOIN Ranges  R1  ON R1.Id = JT.Range1
JOIN Ranges  R2  ON R2.Id = JT.Range2
WHERE
    R1.Date = @date1 AND
    R2.Date = @date2

You will need some testing in order to determine optimal indices, but I think the following should work:

CREATE NONCLUSTERED INDEX RD ON Ranges      ( Date   ) INCLUDE ( id )
CREATE NONCLUSTERED INDEX RR ON MyJoinTable ( Range1, Range2 )
-- optionally: INCLUDE (Entity1Id, Entity2Id)

Index RD will make sure the ranges are quickly found corresponding to the specified dates, and index RR will help to find the record matching these ranges. But then you shall have to devise a means of filling the Ranges table and keeping it in sync with MyJoinTable, because doing so by hand is out of the question.

Related Solutions

Using indexes to create multiple relationships to a table in order to enforce data integrity and add meaning

Situation 1:

Your tables have one relationship and not two. (example: a Device belongs to a DeviceType)

So, keep only one relationship, the one with the composite keys (that include the Primary Key). The other relationship is redundant when the composite one is defined.

I would also suggest you have same names for related columns:

DeviceCategory Table
CategoryCode | Name        | Description
----------------------------------------
WKS          | Workstation | Description of what classifies an item as a workstation...
LPT          | Laptop      | Description of what classifies an item as a laptop...

DeviceType Table
DeviceTypeID | CategoryCode | Manufacturer | Model  | IsTrackedInOtherSystemDefault
-----------------------------------------------------------------------------------
1            | WKS          | Dell         | GX1000 | true
2            | LPT          | HP           | dv4000 | false
3            | WKS          | HP           | xx9000 | false

Device Table
DeviceID | SerialNumber | DeviceTypeID | IsTrackedInOtherSystem | CategoryCode 
------------------------------------------------------------------------------
1        | I81U812      | 1            | true                   | WKS
2        | N0S4A2       | 1            | false                  | WKS
3        | 3BL1NDMIC3   | 2            | false                  | LPT

So, the design would be:

DeviceCategory
--------------
CategoryCode   PK
Name           U1
Description

DeviceType 
----------
DeviceTypeID   PK   U1
CategoryCode   FK   U1
Manufacturer   U2
Model          U2
IsTrackedInOtherSystemDefault

Device 
------
DeviceID       PK   U1
SerialNumber   U2
DeviceTypeID   FK1
IsTrackedInOtherSystem
CategoryCode   FK1  U1

and for the Computer:

Computer
--------
DeviceID       PK   FK1
Hostname       U1
IPAddress      U2
CategoryCode   FK1  CHK

The "additional" UNIQUE keys (the two composite U1 ones) will be needed in most DBMS to enforce the foreign key constraints. I guess this answers your question 2, relationships needs indices to be enforced, so (you have to) use them. They will be used by the DBMS not only to enforce integrity but in your queries/statements, when you will be joining the tables.

The only one that is not needed is the U3 you had in the Computer table.

About question 3 (the over-engineering part): No, I don't think so but that's just my opinion. And you haven't told us if this is a homework/exercise or a real project, whether you will be holding only your family's or a multi-million company's inventory, etc.

Situation 2

I think what you have is fine and there is no need (and not a good idea) to have referential integrity constraints on these columns. This is a default value that is copied in the second table via a stored procedure (I guess during Inserts on the second table?) or altered by a user. If you add an FK, won't that deny users the ability to override the default?

The names of the two columns are self-explanatory enough for a DBA to understand the functionality.

T-sql – Return table based on 1, 2, 3, 4, 5 or 0 selected search options

Dynamic SQL is another option. I would try both this method and @a1ex07's method and see which runs faster for you.

CREATE PROCEDURE WorkAssignmentReport @Category varchar(25)
                                 ,@---Name varchar(25)
                                 ,@Stats varchar(25) 
                                 ,@---Name  varchar(25) 
                                 .........
                                 ......

DECLARE @sql nvarchar(max)

SET @sql = 'SELECT * ' + CHAR(13) + 
        'FROM [Ticket]  ' + CHAR(13) + 
        'INNER JOIN dbo.[TicketType] ON [Ticket].TicketTypeID = [TicketType].TicketTypeID  ' + CHAR(13) + 
        'INNER JOIN dbo.[Status] ON [Ticket].Stat .... ' + CHAR(13) + 
        '..... ' + CHAR(13) + 
        ' ' + CHAR(13) + 
        ' ' + CHAR(13) + 
        'WHERE 1=1 '

IF @Category IS NOT NULL
    SET @sql = @sql + CHAR(13) + '  AND [TicketType].Name LIKE @Category '

IF @ReportName IS NOT NULL
    SET @sql = @sql + CHAR(13) + '  AND [Ticket].Name LIKE @ReportName  '

IF @Status IS NOT NULL
    SET @sql = @sql + CHAR(13) + '  AND [Status].StatusName LIKE @Status '

IF @PhaseName IS NOT NULL
    SET @sql = @sql + CHAR(13) + '  AND [Phase].PhaseName LIKE @PhaseName  '

IF @Name1 IS NOT NULL
    SET @sql = @sql + CHAR(13) + '  AND (BA.FullName = @Name1 OR AD.FullName = @Name1) '

EXEC sp_executesql @sql, N'@Category varchar(25), @ReportName varchar(25), @Status varchar(25), @PhaseName varchar(25), @Name1 varchar(25)',
                @Category, @ReportName, @Status, @PhaseName, @Name1