Sql-server – SQL Server 2008 – Partitioning and Clustered Indexes

indexpartitioningsql serversql-server-2008-r2

So let me preface by saying I do not have total control over my db design, so a lot of the aspects of the current system cannot be changed for the purposes of this scenario.

Comments about how we should rethink aspects of the design are likely correct but unhelpful 🙂

I have a very large table, approx 150 fields wide and about 600m rows, that drives a large number of processes. This is in a data warehouse situation so we don't have ANY updates/inserts outside the scheduled load process, so it is heavily indexed.

A decision has been made to try partitioning this table, and I have some concerns about indexing a partitioned table. I don't have any experience with partitioning, so any input or links are appreciated. I couldn't locate specifically what I am after on BOL or msdn.

Currently we cluster on a field that we'll call IncidentKey which is a varchar(50) and not unique – we could have between 1-100 records with the same IK (no comments please). We do often get new data on old IncidentKey records so it's not sequential either.

I understand I need to include my partition field, IncidentDate, in my clustered index key for the partition to work correctly. I'm thinking it would be IncidentKey, IncidentDate.

The question is, how will the mechanics of a clustered index work on a 2 part key in a partitioned table, if a record in a "new" partition should be before a record in an "old" partition in the clustered index?

For example, I have 5 records:

IncidentKey    Date

ABC123        1/1/2010
ABC123        7/1/2010
ABC123        1/1/2011
XYZ999        1/1/2010
XYZ999        7/1/2010

If I get a new record for ABC123, 2/1/2011 it will need to be in the clustered index BEFORE XYZ999, 1/1/2010. How does this work?

I'm assuming fragmentation and pointers, but I can't find any info on the physical storage and configuration of non-partitioned clustered indexes on partitioned tables with dual-part keys.

Best Answer

A partitioned table is really more like a collection of individual tables stitched together. So your in example of clustering by IncidentKey and partition by IncidentDate, say that the partitioning function splits the tables into two partitions so that 1/1/2010 is in partition 1 and 7/1/2010 is partition two. The data will be layed out on disk as:

Partition 1:
IncidentKey    Date
ABC123        1/1/2010
ABC123        1/1/2011
XYZ999        1/1/2010

Partition 2:
IncidentKey    Date
ABC123        7/1/2010
XYZ999        7/1/2010

At a low level there really are two, distinct rowsets. Is the query processor that gives the illusion of a single table by creating plans that seek, scan and update all rowsets together, as one.

Any row in any non-clustered index will have have the clustered index key to which it corresponds, say ABC123,7/1/2010. Since the clustered index key always contains the partitioning key column, the engine will always know in what partition (rowset) of the clustered index to search for this value (in this case, in partition 2).

Now whenever you're dealing with partitioning you must consider if your NC indexes will be aligned (NC index is partitioned exactly the same as the clustered index) or non-aligned (NC index is non-partitioned, or partitioned differently from clustered index). Non-aligned indexes are more flexible, but they have some drawbacks:

Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model:

  • aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)
  • all foreign keys referencing the partitioned table must include the partitioning key in the relation (since the partitioning key is, due to alignment, in every index), and this in turn requires that all tables referencing the partitioned table contain partitioning key column value. Think Orders->OrderDetails, if Orders have OrderID but is partitioned by OrderDate, then OrderDetails must contain not only OrderID, but also OrderDate, in order to properly declare the foreign key constraint.

These effects I found seldom called out at the beginning of a project that deploys partitioning, but they exists and have serious consequences.

If you think aligned indexes are a rare or extreme case, then consider this: in many cases the cornerstone of ETL and partitioning solutions is the fast switch in of staging tables. Switch in operations require aligned indexes.

Oh, one more thing: all my argument about foreign keys and the ripple effect of adding the partitioning column value to other tables applies equally to joins.