Sql-server – Best way to index a table with two datetime columns that are equally queried

database-designsql-server-2008-r2

Sql Server 2008 R2.

I have a table with ~70m records, about 10 inserts per second.

It's currently clustered on a CreatedAt datetime column, which always increases. 50% of queries involve this column.

There is another datetime column "IssuedAt" which has a different meaning, but is generally within a day or so of the CreatedAt. 50% of queries involve this column.
NC index on this column.

There are a number of other FK columns – probably about 150 bytes wide if that's relevant. And a few more indexes – the table is heavily queried for a variety of reports in different ways.

My question is the best way to index this table with regards to clustered index, and the two datetime columns.

a) I'm concerned the clustered key is bigger than it has to be (datetime + implicit discriminator) leading to bigger NC index size. Should I add an INT identity, cluster on that instead?

b) My queries against the IssuedAt column can be expensive due to bookmark lookups. I'm faced with INCLUDING more and more columns into it (hurting write perf). Is there an alternative tactic here?

Thanks in advance.

UPDATE:

Just for clarity – I'm aware of the need for benchmarking – I can see that some of the queries are not being satisfied satisfactorily.

There is an inherent tension between 50% of the queries relying on datetime column A with a CI, and 50% on datetime column B with an NCI. I was hoping there might be some sort of approach/trick worth considering to offset this tension. For example a new filtered index, or moving to a clustered index which is the date component of both if they match, or some other technique along those lines that others use to relieve this tension.

Second Update:

I'm now considering the following:

Create a new column: IssuedAtOffset (int). This is the difference between CreatedAt and IssuedAt, in seconds. I know from business realitites that int is sufficient to capture this delta, and also IssuedAt's milliseconds are irrelevant, so that will work.
Update IssuedAtOffset to the correct values.
Drop IssuedAt.
Make the CI now: CreatedAt + IssuedAtOffset.
Create a view where IssuedAt is returned as a calculated column based on CreatedAt + IssuedAtOffset.

Thus any queries against either CreatedAt or IssuedAt can be served directly from the CI when I query that view. It will mean more CPU since that addition has to happen for every row, but I'm guessing that's trivial vs the IO savings (still to be benchmarked of course).

Have I missed any downsides/problems here?

Third update:

I created two tables to test this approach above, and the results are not quite what I expected.

T1 represents the current setup:

CREATE TABLE [dbo].[t1](
[CreatedAt] [datetime] NULL,
[IssuedAt] [datetime] NULL,
[Col1] [varchar](20) NULL,
[Col2] [char](4000) NULL

) ON [PRIMARY]

CI on one column.

CREATE CLUSTERED INDEX [IX_Clustered] ON [dbo].[t1]

(
[CreatedAt] ASC
)

NCI on second date time column.

CREATE NONCLUSTERED INDEX [IX_t1_NC] ON [dbo].[t1]

(
[IssuedAt] ASC
)

T2 represents the proposed setup:

CREATE TABLE [dbo].[t2](
[CreatedAt] [datetime] NULL,
[IssuedAt] [datetime] NULL,  ---this would be dropped in due course
[Col1] [varchar](20) NULL,
[Col2] [char](4000) NULL,
[IssuedAtOffset] [int] NOT NULL

) ON [PRIMARY]

Single CI on both columns.

CREATE CLUSTERED INDEX [IX_t2_CI] ON [dbo].[t2]

(
[CreatedAt] ASC,
[IssuedAtOffset] ASC
)

I inserted 100k records into t1, copied them to t2, and got IssuedAtOffset to the correct value in T2.

I created a view to make dealing with T2 simpler:

create view [dbo].[t2v]

as
select createdat, dateadd( second, issuedatoffset, createdat) as 'IssuedAtO', col1, col2
from t2

So now it was time to confirm the performance changes:

On the old table:

select col2 from t1 where createdat < '2013-01-30'

gives a CI seek on IX_Clustered, as expected.

select col2 from t1 where issuedat  < '2013-01-30'

uses the NCI and then a key lookup, as expected.

On the new table:

select col2 from t2v where createdat < '2013-01-30'

uses a clustered index seek, as expected.

CPU: 0, reads: 3, Duration: 0

Now, the fun part:
select col2 from t2v where IssuedAtO < '2013-01-30'
uses a clustered index SCAN.
CPU: 62, reads: 61094, Duration: 61

I was expecting higher CPU due to the dateadd function. And glad its using the CI, as no bookmark lookups to get to the Col2 data.

But not sure I was expecting a scan – the reads have gone through the roof, presumably because this is a scan, and not a seek.

This may mitigate the anticipated benefits of this approach – why is it that its changed to a scan, and is there any way I could construct the CI such that it's a seek instead?

Thank you.

Best Answer

There is no substitute for benchmarking. To answer the question, I would create and populate several possible tables. Then I would expose these tables to your typical workload, and benchmark.

Including additional columns in you NCIs will slow down modifications and speed up selects. Based on frequency of both, we can choose which approach uses less resources. If a row is on average read twice per year, your conclusions might be different as compared to the case when every row is on average read twice per minute.

Besides, not all queries are born equal. If some queries must complete in certain time no matter what, then you must make sure these requirements are met. Obviously such requirement trump the common good approach described above. Only you can know the actual requirements.

Related Solutions

Mysql – indexes, foreign keys and optimization

In the example you show, the expression would use the index on criteria to resolve which rows to retrieve, and the leaf nodes in the index contain references to the corresponding rows, either by physical location in the case of MyISAM or by primary key value in the case of InnoDB. That way it avoids a full table scan.

Note that if the query optimizer thinks your query matches too much of the table, it will ignore the index, because it's just as fast to do the table-scan. By analogy, this is why no one bothers to put the word "the" in the index at the back of a book.
No, MySQL requires that an index exists for the foreign key. If an index already exists, InnoDB will use it, but it will create one if necessary. I have found some corner cases where you can trick InnoDB by dropping the index after defining the foreign key constraint, but this would not be advisable, because the index is important for checking referential integrity quickly.
It's true that insert/update/delete needs to update indexes to reflect values added or removed. But it's not as bad as you might fear. Also, InnoDB implements a "change buffer" which records how your update affects secondary indexes, and reconciles it with the whole index later.
InnoDB and MyISAM support only BTREE indexes. InnoDB also has an internal feature called the adaptive hash which does do a hash lookup for frequently-requested index entries. But this is completely automatic and transparent; you don't have to define any index or tune any config parameters.

Sql-server – How to dm_db_index_usage_stats show 0 for user_seeks and user scans but a large number for user_lookups

The user_lookups are lookups in to the index. The original scan or seek will have been on another non covering index.

For example

CREATE TABLE T
(
X INT IDENTITY PRIMARY KEY,
Y INT CONSTRAINT UQ UNIQUE NONCLUSTERED,
Z CHAR(8000) NULL
)

INSERT INTO T(Y)
SELECT DISTINCT number 
FROM master..spt_values

GO

DECLARE @Z INT
SELECT @Z = Z
FROM T 
WHERE Y = 1

GO 1000

Gives the following plan

Plan

SELECT index_id, user_seeks,user_scans, user_lookups
FROM sys.dm_db_index_usage_stats
WHERE object_id = object_id('T')
ORDER BY index_id

Shows 1,000 seeks on the NCI and 1,000 lookups on the CI.

index_id    user_seeks           user_scans           user_lookups
----------- -------------------- -------------------- --------------------
1           0                    0                    1000
2           1000                 0                    0

BTW: Just in case you were not aware the user_lookups shown here refers to the number of times a plan was executed containing the lookup operator not the number of lookups that actually occurred. e.g. Both of the following increment the counter by 1 despite performing 0 and 2,161 lookups in reality

SELECT  Z
FROM T 
WHERE Y = -999

SELECT Z
FROM T WITH (FORCESEEK)
WHERE Y >= 0

Best Answer

Related Solutions

Mysql – indexes, foreign keys and optimization

Sql-server – How to dm_db_index_usage_stats show 0 for user_seeks and user scans but a large number for user_lookups

Related Question