Sql-server – Best way to index a table with two datetime columns that are equally queried

database-designsql-server-2008-r2

Sql Server 2008 R2.

I have a table with ~70m records, about 10 inserts per second.

It's currently clustered on a CreatedAt datetime column, which always increases. 50% of queries involve this column.

There is another datetime column "IssuedAt" which has a different meaning, but is generally within a day or so of the CreatedAt. 50% of queries involve this column.
NC index on this column.

There are a number of other FK columns – probably about 150 bytes wide if that's relevant. And a few more indexes – the table is heavily queried for a variety of reports in different ways.

My question is the best way to index this table with regards to clustered index, and the two datetime columns.

a) I'm concerned the clustered key is bigger than it has to be (datetime + implicit discriminator) leading to bigger NC index size. Should I add an INT identity, cluster on that instead?

b) My queries against the IssuedAt column can be expensive due to bookmark lookups. I'm faced with INCLUDING more and more columns into it (hurting write perf). Is there an alternative tactic here?

Thanks in advance.

UPDATE:

Just for clarity – I'm aware of the need for benchmarking – I can see that some of the queries are not being satisfied satisfactorily.

There is an inherent tension between 50% of the queries relying on datetime column A with a CI, and 50% on datetime column B with an NCI. I was hoping there might be some sort of approach/trick worth considering to offset this tension. For example a new filtered index, or moving to a clustered index which is the date component of both if they match, or some other technique along those lines that others use to relieve this tension.

Second Update:

I'm now considering the following:

  1. Create a new column: IssuedAtOffset (int). This is the difference between CreatedAt and IssuedAt, in seconds. I know from business realitites that int is sufficient to capture this delta, and also IssuedAt's milliseconds are irrelevant, so that will work.

  2. Update IssuedAtOffset to the correct values.

  3. Drop IssuedAt.

  4. Make the CI now: CreatedAt + IssuedAtOffset.

  5. Create a view where IssuedAt is returned as a calculated column based on CreatedAt + IssuedAtOffset.

Thus any queries against either CreatedAt or IssuedAt can be served directly from the CI when I query that view. It will mean more CPU since that addition has to happen for every row, but I'm guessing that's trivial vs the IO savings (still to be benchmarked of course).

Have I missed any downsides/problems here?

Third update:

I created two tables to test this approach above, and the results are not quite what I expected.

T1 represents the current setup:

CREATE TABLE [dbo].[t1](
[CreatedAt] [datetime] NULL,
[IssuedAt] [datetime] NULL,
[Col1] [varchar](20) NULL,
[Col2] [char](4000) NULL

) ON [PRIMARY]

CI on one column.

CREATE CLUSTERED INDEX [IX_Clustered] ON [dbo].[t1] 

(
[CreatedAt] ASC
)

NCI on second date time column.

CREATE NONCLUSTERED INDEX [IX_t1_NC] ON [dbo].[t1] 

(
[IssuedAt] ASC
)

T2 represents the proposed setup:

CREATE TABLE [dbo].[t2](
[CreatedAt] [datetime] NULL,
[IssuedAt] [datetime] NULL,  ---this would be dropped in due course
[Col1] [varchar](20) NULL,
[Col2] [char](4000) NULL,
[IssuedAtOffset] [int] NOT NULL

) ON [PRIMARY]

Single CI on both columns.

CREATE CLUSTERED INDEX [IX_t2_CI] ON [dbo].[t2] 

(
[CreatedAt] ASC,
[IssuedAtOffset] ASC
)

I inserted 100k records into t1, copied them to t2, and got IssuedAtOffset to the correct value in T2.

I created a view to make dealing with T2 simpler:

create view [dbo].[t2v] 

as
select createdat, dateadd( second, issuedatoffset, createdat) as 'IssuedAtO', col1, col2
from t2

So now it was time to confirm the performance changes:

On the old table:

select col2 from t1 where createdat < '2013-01-30'

gives a CI seek on IX_Clustered, as expected.

select col2 from t1 where issuedat  < '2013-01-30'

uses the NCI and then a key lookup, as expected.

On the new table:

select col2 from t2v where createdat < '2013-01-30'

uses a clustered index seek, as expected.

CPU: 0, reads: 3, Duration: 0

Now, the fun part:
select col2 from t2v where IssuedAtO < '2013-01-30'
uses a clustered index SCAN.
CPU: 62, reads: 61094, Duration: 61

I was expecting higher CPU due to the dateadd function. And glad its using the CI, as no bookmark lookups to get to the Col2 data.

But not sure I was expecting a scan – the reads have gone through the roof, presumably because this is a scan, and not a seek.

This may mitigate the anticipated benefits of this approach – why is it that its changed to a scan, and is there any way I could construct the CI such that it's a seek instead?

Thank you.

Best Answer

There is no substitute for benchmarking. To answer the question, I would create and populate several possible tables. Then I would expose these tables to your typical workload, and benchmark.

Including additional columns in you NCIs will slow down modifications and speed up selects. Based on frequency of both, we can choose which approach uses less resources. If a row is on average read twice per year, your conclusions might be different as compared to the case when every row is on average read twice per minute.

Besides, not all queries are born equal. If some queries must complete in certain time no matter what, then you must make sure these requirements are met. Obviously such requirement trump the common good approach described above. Only you can know the actual requirements.