Sql-server – Large Fact table and partitioning key dilemma

clustered-indexnonclustered-indexpartitioningsql-server-2008-r2

I have fairly large fact table (2 billion records, approx 120 GB). This table is not partitioned and the queries are very slow to respond. I am planning to partition the table and indexes.
The table has an identity column which is the primary key and has a clustered index on it. There are other non-clustered indexes on it but I won't go in the details much here. The column, I am trying to partition is not part of primary key but is not null and this is creating a slight dilemma for me. I have two options.

I add this column as part of primary key i.e. composite primary key. Since the first column is identity, the combination would always be unique which means I don't have to worry about the applications accessing the table. The clustered index will automatically be partition aligned and other indexes can also be partition aligned.
The seconds option is to remove the clustered index on the identity column and make it unique non clustered. This index cannot be partitioned aligned since partition key is not part of it and hence would have to sit on one drive. Then create a clustered index on the partition key column which can be partition aligned and so all the other non clustered indexes.

Our DBA is in favour of second option since he doesn't want to change primary key. I am worried about the performance hit in option 2 since the index is not partition aligned.

I would appreciate any feedback plus any other method you would have used in such situation.

Best Answer

I hate to state the obvious but I'd say test both scenarios and run your production queries against all 3 scenarios (scenario 1 being the current non partitioned one). The reason I state this is because I don't know what your code is querying. Do they actually have a benefit of being sorted in the base table by the identity column as opposed to the other column? For example, are your queries actually looking for the row ID often? If you're not sure, you might be surprised by getting some performance benefits. The general idea of using ID as the clustered key was for range scans, but in our case, we scan by customerID and Date, so it worked out perfectly for us, and perhaps you. Check out this article by Kim Tripp: http://www.sqlskills.com/blogs/kimberly/post/The-Clustered-Index-Debate-Continues.aspx

"What is often cited as the “reason” for IDENTITY PRIMARY KEY clustered index definitions is its monotonic nature, thus minimizing page splits. However, I argue that this is the only “reason” for defining the clustered index as such, and is the poorest reason in the list. Page Splits are managed by proper FILLFACTOR not increasing INSERTS. Range Scans are the most important “reason” when evaluating clustered index key definitions and IDENTITies do not solve this problem.Moreover, although clustering the IDENTITY surrogate key will minimize page splits and logical fragmentation due to its monotonic nature, it will not reduce EXTENT FRAGMENTATION, which can cause just as problematic query performance as page splitting. In short, the argument runs shallow "

Typically you want your clustered index to be as narrow as possible, unique, and non nullable as it get's carried into all other indexes. I have a table with roughly 10 billion rows and we partition off the datetime column, which has worked out great.

Related Solutions

Sql-server – SQL Server 2008 – Partitioning and Clustered Indexes

A partitioned table is really more like a collection of individual tables stitched together. So your in example of clustering by IncidentKey and partition by IncidentDate, say that the partitioning function splits the tables into two partitions so that 1/1/2010 is in partition 1 and 7/1/2010 is partition two. The data will be layed out on disk as:

Partition 1:
IncidentKey    Date
ABC123        1/1/2010
ABC123        1/1/2011
XYZ999        1/1/2010

Partition 2:
IncidentKey    Date
ABC123        7/1/2010
XYZ999        7/1/2010

At a low level there really are two, distinct rowsets. Is the query processor that gives the illusion of a single table by creating plans that seek, scan and update all rowsets together, as one.

Any row in any non-clustered index will have have the clustered index key to which it corresponds, say ABC123,7/1/2010. Since the clustered index key always contains the partitioning key column, the engine will always know in what partition (rowset) of the clustered index to search for this value (in this case, in partition 2).

Now whenever you're dealing with partitioning you must consider if your NC indexes will be aligned (NC index is partitioned exactly the same as the clustered index) or non-aligned (NC index is non-partitioned, or partitioned differently from clustered index). Non-aligned indexes are more flexible, but they have some drawbacks:

non-aligned indexes require large amounts of memory for certain query plans
non-aligned indexes prevent efficient partition switch operations

Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model:

aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)
all foreign keys referencing the partitioned table must include the partitioning key in the relation (since the partitioning key is, due to alignment, in every index), and this in turn requires that all tables referencing the partitioned table contain partitioning key column value. Think Orders->OrderDetails, if Orders have OrderID but is partitioned by OrderDate, then OrderDetails must contain not only OrderID, but also OrderDate, in order to properly declare the foreign key constraint.

These effects I found seldom called out at the beginning of a project that deploys partitioning, but they exists and have serious consequences.

If you think aligned indexes are a rare or extreme case, then consider this: in many cases the cornerstone of ETL and partitioning solutions is the fast switch in of staging tables. Switch in operations require aligned indexes.

Oh, one more thing: all my argument about foreign keys and the ripple effect of adding the partitioning column value to other tables applies equally to joins.

Sql-server – Unique Non Clustered Column in Partitioned table

You cannot have an unique constraints backed by an aligned index (or plain unique non-clustered indexes) unless you add the partitioning column to the the unique expression. So if you have partitioned your table on column [datetime] then your unique constraint (or the unique index) must be ([datetime], [xyz]). Since more often than not this is not acceptable, the alternative are to:

remove the unique constraint from the data model (ie. accept that duplicates can occur)
keep the non-aligned index, with all switch-in/switch-out issues and performance problems

Best Answer

Related Solutions

Sql-server – SQL Server 2008 – Partitioning and Clustered Indexes

Sql-server – Unique Non Clustered Column in Partitioned table

Related Question