We have a table that is clustered on identity/datetime2. It is partitioned on the same datetime2. Are there any reasons to cluster on datetime2/identity instead? I understand the reasoning behind clustering generally, but with partitioning included, do things change?
Sql-server – Partition Function vs. Clustered Index
indexpartitioningsql serversql-server-2008-r2
Related Solutions
A partitioned table is really more like a collection of individual tables stitched together. So your in example of clustering by IncidentKey
and partition by IncidentDate
, say that the partitioning function splits the tables into two partitions so that 1/1/2010 is in partition 1 and 7/1/2010 is partition two. The data will be layed out on disk as:
Partition 1:
IncidentKey Date
ABC123 1/1/2010
ABC123 1/1/2011
XYZ999 1/1/2010
Partition 2:
IncidentKey Date
ABC123 7/1/2010
XYZ999 7/1/2010
At a low level there really are two, distinct rowsets. Is the query processor that gives the illusion of a single table by creating plans that seek, scan and update all rowsets together, as one.
Any row in any non-clustered index will have have the clustered index key to which it corresponds, say ABC123,7/1/2010
. Since the clustered index key always contains the partitioning key column, the engine will always know in what partition (rowset) of the clustered index to search for this value (in this case, in partition 2).
Now whenever you're dealing with partitioning you must consider if your NC indexes will be aligned (NC index is partitioned exactly the same as the clustered index) or non-aligned (NC index is non-partitioned, or partitioned differently from clustered index). Non-aligned indexes are more flexible, but they have some drawbacks:
- non-aligned indexes require large amounts of memory for certain query plans
- non-aligned indexes prevent efficient partition switch operations
Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model:
- aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)
- all foreign keys referencing the partitioned table must include the partitioning key in the relation (since the partitioning key is, due to alignment, in every index), and this in turn requires that all tables referencing the partitioned table contain partitioning key column value. Think Orders->OrderDetails, if Orders have OrderID but is partitioned by OrderDate, then OrderDetails must contain not only OrderID, but also OrderDate, in order to properly declare the foreign key constraint.
These effects I found seldom called out at the beginning of a project that deploys partitioning, but they exists and have serious consequences.
If you think aligned indexes are a rare or extreme case, then consider this: in many cases the cornerstone of ETL and partitioning solutions is the fast switch in of staging tables. Switch in operations require aligned indexes.
Oh, one more thing: all my argument about foreign keys and the ripple effect of adding the partitioning column value to other tables applies equally to joins.
Because a foreign key can point to a primary key or a unique constraint, and whoever created that foreign key possibly created it before the primary key existed (or they shifted the FK to point to the Unique index while they changed something else about the primary key). This is easy to repro:
CREATE TABLE dbo.MyTable(MyTableID INT NOT NULL, CONSTRAINT myx UNIQUE(MyTableID));
CREATE TABLE dbo.OtherTable1(ID INT FOREIGN KEY REFERENCES dbo.MyTable(MyTableID));
ALTER TABLE dbo.MyTable ADD CONSTRAINT PKmyx PRIMARY KEY(MyTableID);
CREATE TABLE dbo.OtherTable2(ID INT FOREIGN KEY REFERENCES dbo.MyTable(MyTableID));
In fact, both of these foreign keys will point to the first unique constraint defined on that column (myx
).
You can fix the foreign key on the other table by dropping it and re-creating it. You will need to repeat that process for any other tables that point to this column. You can find these easily:
SELECT s.name,t.name,fk.name
FROM sys.foreign_key_columns AS fkc
INNER JOIN sys.foreign_keys AS fk
ON fkc.constraint_object_id = fk.[object_id]
INNER JOIN sys.tables AS t
ON fkc.parent_object_id = t.[object_id]
INNER JOIN sys.schemas AS s
ON t.[schema_id] = s.[schema_id]
INNER JOIN sys.columns AS c1
ON c1.[object_id] = fkc.referenced_object_id
AND c1.column_id = fkc.referenced_column_id
AND c1.name = N'MyTableID'
WHERE fkc.referenced_object_id = OBJECT_ID('dbo.MyTable');
Results:
dbo OtherTable1 FK__OtherTable1__ID__32E0915F
dbo OtherTable2 FK__OtherTable2__ID__35BCFE0A
And even generate a script to drop and re-create them (dropping the redundant unique constraint in the meantime):
DECLARE
@sql1 NVARCHAR(MAX) = N'',
@sql2 NVARCHAR(MAX) = N'ALTER TABLE dbo.MyTable DROP CONSTRAINT myx;',
@sql3 NVARCHAR(MAX) = N'';
SELECT
@sql1 += N'
ALTER TABLE ' + QUOTENAME(s.name) + '.' + QUOTENAME(t.name)
+ ' DROP CONSTRAINT ' + QUOTENAME(fk.name) + ';',
@sql3 += N'
ALTER TABLE ' + QUOTENAME(s.name) + '.' + QUOTENAME(t.name)
+ ' ADD CONSTRAINT ' + QUOTENAME(fk.name) + ' FOREIGN KEY '
+ '(' + QUOTENAME(c2.name) + ') REFERENCES dbo.MyTable(MyTableID);'
FROM sys.foreign_key_columns AS fkc
INNER JOIN sys.foreign_keys AS fk
ON fkc.constraint_object_id = fk.[object_id]
INNER JOIN sys.tables AS t
ON fkc.parent_object_id = t.[object_id]
INNER JOIN sys.schemas AS s
ON t.[schema_id] = s.[schema_id]
INNER JOIN sys.columns AS c1
ON c1.[object_id] = fkc.referenced_object_id
AND c1.column_id = fkc.referenced_column_id
AND c1.name = N'MyTableID'
INNER JOIN sys.columns AS c2
ON c2.[object_id] = fkc.parent_object_id
AND c2.column_id = fkc.parent_column_id
WHERE fkc.referenced_object_id = OBJECT_ID('dbo.MyTable');
PRINT @sql1;
PRINT @sql2;
PRINT @sql3;
-- EXEC sp_executesql @sql1;
-- EXEC sp_executesql @sql2;
-- EXEC sp_executesql @sql3;
Results:
ALTER TABLE [dbo].[OtherTable1] DROP CONSTRAINT [FK__OtherTable1__ID__32E0915F];
ALTER TABLE [dbo].[OtherTable2] DROP CONSTRAINT [FK__OtherTable2__ID__35BCFE0A];
ALTER TABLE dbo.MyTable DROP CONSTRAINT myx;
ALTER TABLE [dbo].[OtherTable1] ADD CONSTRAINT [FK__OtherTable1__ID__32E0915F]
FOREIGN KEY ([ID]) REFERENCES dbo.MyTable(MyTableID);
ALTER TABLE [dbo].[OtherTable2] ADD CONSTRAINT [FK__OtherTable2__ID__35BCFE0A]
FOREIGN KEY ([ID]) REFERENCES dbo.MyTable(MyTableID);
This explicitly handles this case, where the constraint only involves a single column. It gets a little more complex if there are multiple columns involved (and this answer is not meant to solve that problem). I also didn't test if this works exactly as coded if the foreign keys point to a redundant unique index (which has the same underlying structure but is created with slightly different DDL). Exercise for the reader. :-)
Related Question
- SQL Server – Advantages of Not Partition Aligning an Index
- SQL Server – Difference Between Clustered Index and Partitioning Column
- SQL Server – Clustered Index on Random Order Column
- Sql-server – Best option to index an application log table for quick insertion and retrieval in date order
- SQL Server – Composite Clustered Index with Partitioning
Best Answer
Partitioning a table only divides it into "chunks" based on the partition function. The clustered index will give order to the data within each partition.
If you're planning to run queries that involve parts of a partition (i.e., show me sales between Jan 5th and Jan 12th), then it can be advantageous to those queries to have the date as the leading column of the clustering key. This type of structure will result in clustered index seeks, instead of partition scans. (Assuming there are no other suitable indexes on the table.)
If the queries only touch entire partitions, it doesn't matter as much, as partition elimination is enough to isolate the needed data. That said, ordering by date first could eliminate the need for an expensive sort operation, depending on what you're doing.
But this also depends on which columns you need from the table. If you only need to do something like aggregate total sales amounts over a date range, it may be sufficient to use a covering nonclustered index with the date as the leading (or only) column of the key, instead of rebuilding the entire table just for that.
If you do change the clustered index, this will affect singleton lookups (probably by the identity column which I assume is the primary key) as they will now involve a nonclustered index seek + key lookup. If this type of activity isn't a major part of the workload, this will be fine, but you have to be really careful that not too many rows are selected by these queries, or the optimizer will revert to a partition scan on the assumption that it's cheaper. Again, depending on which columns you need, it may be advantageous to create a covering nonclustered index that includes only the columns you need.