Sql-server – Which columns (and order) to choose for a clustered index to maximize query performance

clustered-indexsql servert-sql

In our database we have a table that looks more or less like the example below. In the past we always created the clustered index on the ID column on the table.

CREATE TABLE [Measurement]
(
    [ID] INT NOT NULL PRIMARY KEY NOT NULL,
    [ParameterID] INT NOT NULL,
    [Measuretime] DATETIME NOT NULL,
    [Value] FLOAT NOT NULL,

    CONSTRAINT [FK_Measurement_Parameter]
        FOREIGN KEY ([ParameterID]) REFERENCES [Parameter]([ID])
)

CREATE INDEX [IX_Measurement_Measuretime_ParameterID]
    ON [Measurement] ([Measuretime]) INCLUDE ([ParameterID]);

CREATE INDEX [IX_Measurement_ParameterID_Measuretime]
    ON [Measurement] ([ParameterID]) INCLUDE ([Measuretime]);

Our data gets written in 1-5 second intervals with successive timestamps for all Parameters.

We decided that it would probably be a better idea to create the clustered index on ParameterID and/or Measuretime as most queries are on those to columns.

Here are some example how most our queries look like:

Example A

SELECT *
FROM Measurement
WHERE ParameterID = 1
    and Measuretime between '2015-01-24' and '2015-01-25'

Example B

SELECT ParameterID, cast(Measuretime as date), avg(value)
FROM Measurement
WHERE ParameterID = 1
    and Measuretime between '2015-01-01' and '2015-02-01'
GROUP BY ParameterID, cast(Measuretime as date)

Example C

SELECT DISTINCT
    ParameterID,
    FIRST_VALUE(cast(Measuretime as date))
          OVER (PARTITION BY cast(Measuretime as date), ParameterID
                ORDER BY Measuretime ) Measuredate,
    PERCENTILE_CONT(.25)
         WITHIN GROUP(ORDER BY Value)
         OVER (PARTITION BY cast(Measuretime as date), ParameterID) as [q1],
    ...
FROM Measurement
WHERE Measuretime between '2015-01-01' and '2015-02-01'
    -- and ParameterID = 1
ORDER BY Measuretime, ParameterID

Which of those three ways that come to my mind for INDEX creation is the best suited for such a scenario?

CREATE CLUSTERED INDEX [CIX_Measurement] ON [Measurement]([Measuretime],[ParameterID]) as this is also the order data gets written, and both columns are queried.
CREATE CLUSTERED INDEX [CIX_Measurement] ON [Measurement]([ParameterID],[Measuretime]) as nearly all our queries need to filter by ParameterID in one way or another and only afterwards Measuretime.
On only one of those to and go for a normal INDEX for the other column.

Best Answer

Generally, you want to order the columns in the index based on cardinality. That is, the most unique column first, then the second most unique, etc. So, you need to answer whether parameterid or measuretime will have the least duplicate values. Example: if Measuretime has fewer duplicate values than parameterid use option 1. Then, for optimal performance you need to edit your queries to reference the columns in the correct order.

Something to consider, however, is the pattern of writes to this table. Is the data inserted in the order of Measuretime? Will a new ParameterID always be a higher increment of a previously added one? Are there lots of updates and deletes to this table?

The advantage of having a surrogate key, like your ID column above, is that it ensures ordered inserts into the table/clustered index. Among other benefits, this avoids page splits on inserts.

Also, if the combination of ParameterID and Measuretime will always be unique, you could consider a composite primary key. This has its downsides, but is a valid option.

Here's a good explanation of using composite primary keys: http://weblogs.sqlteam.com/jeffs/archive/2007/08/23/composite_primary_keys.aspx

Related Solutions

Sql-server – Large Fact table and partitioning key dilemma

I hate to state the obvious but I'd say test both scenarios and run your production queries against all 3 scenarios (scenario 1 being the current non partitioned one). The reason I state this is because I don't know what your code is querying. Do they actually have a benefit of being sorted in the base table by the identity column as opposed to the other column? For example, are your queries actually looking for the row ID often? If you're not sure, you might be surprised by getting some performance benefits. The general idea of using ID as the clustered key was for range scans, but in our case, we scan by customerID and Date, so it worked out perfectly for us, and perhaps you. Check out this article by Kim Tripp: http://www.sqlskills.com/blogs/kimberly/post/The-Clustered-Index-Debate-Continues.aspx

"What is often cited as the “reason” for IDENTITY PRIMARY KEY clustered index definitions is its monotonic nature, thus minimizing page splits. However, I argue that this is the only “reason” for defining the clustered index as such, and is the poorest reason in the list. Page Splits are managed by proper FILLFACTOR not increasing INSERTS. Range Scans are the most important “reason” when evaluating clustered index key definitions and IDENTITies do not solve this problem.Moreover, although clustering the IDENTITY surrogate key will minimize page splits and logical fragmentation due to its monotonic nature, it will not reduce EXTENT FRAGMENTATION, which can cause just as problematic query performance as page splitting. In short, the argument runs shallow "

Typically you want your clustered index to be as narrow as possible, unique, and non nullable as it get's carried into all other indexes. I have a table with roughly 10 billion rows and we partition off the datetime column, which has worked out great.

SQL Server Query Performance – Optimizing Group By with Many Columns

The non-clustered index you have tested is not the best for this query. It can be used for the WHERE clause and for doing an index scan instead of a full table scan but it cannot be used for the GROUP BY.

The best possible index would have to be a partial index (to filter the unwanted rows from the WHERE clause), then have all the columns used in the GROUP BY and then INCLUDE all the other columns used in the SELECT:

CREATE INDEX special_ix 
  ON dbo.Commissions_Output
    ( company, location, account, 
      salesroute, employee, producttype, 
      item, loadjdate, commissionrate ) 
INCLUDE 
  ( [Extended Sales Price], [Delivered Qty] ) 
WHERE 
  ( [Extended Sales Price] <> 0 ) ;

Best Answer

Related Solutions

Sql-server – Large Fact table and partitioning key dilemma

SQL Server Query Performance – Optimizing Group By with Many Columns

Related Question