Sql-server – Primary keys, clustered indexes and partitioning

database-designenterprise-editionpartitioningsql serversql-server-2012

We store data for financial records, splitting it up over 4 tables. I'm at a bit of a loss as to what's optimal for our setup here, or at least which direction to head in as I'm getting a lot of conflicting advice and resistance, and don't have all the facts to decide on ways forward. Below is essentially what all 4 tables look like.

CREATE TABLE [Example].[MasterRecordTable](
    [rowDateTime] [datetime] NOT NULL CONSTRAINT [rowDateTime] DEFAULT (GETDATE()),
    [recordID] [int] IDENTITY(1,1) NOT NULL,
    [date] [date] NOT NULL,
    [field1] ...,
    ...
 CONSTRAINT [PK_MRT] PRIMARY KEY CLUSTERED 
(
    [recordID] ASC,
    [date] ASC
)

CREATE TABLE [Example].[ChildTables](
    [recordID] [int] IDENTITY(1,1) NOT NULL,
    [date] [date] NOT NULL,
    [field1] ...,
    ...
 CONSTRAINT [PK_CT] PRIMARY KEY CLUSTERED 
(
    [recordID] ASC,
    [date] ASC
)

All tables use a clustered primary key that covers recordID and date, even though at no point while processing do we JOIN on or include date in WHERE clauses, and it's only recordID that needs to be a primary key. To me date should not be present in the primary key or clustered index.
All tables carry the date field, is this necessary if you consider my question about archiving/partioning below? The only reason it's in all tables is to assist with our manual archiving process.
We do not make use of foreign keys, and I'd like to know if this is worth reconsidering
We need an archiving strategy. At the moment we have tables with the same structure as above, with _Archive added to the name and placed in a separate file-group and hard drive. We then manually move records WHERE date <= @aYearAgo for each table over to it's _Archive equivalent, daily. This, as well as developing queries that bridge both tables, is tedious and time-consuming. We're busy evaluating partitioning, and I'd like to have some well-informed answers as to what setup is ideal considering the table structure above. It makes sense to partition and archive based on the date value. We'd like to be able to move older data over to progressively slower hard-drives, then deleted after 5 years.
Would it not be more optimal to have a clustered primary key on recordID alone, does the partitioning field (date) have to be apart of the clustered index?
The only time we need date is when we report on these tables. Depending on your answers above, I imagine this can be done best with an NC index on the Master table, which is then joined to the Child tables via recordID.

Please let me know if anything is unclear. This is a dump of the concerns off the top of my head, I'm keen to work with someone knowledgeable on indexes, key relationships, and partitioning. Thanks!

Best Answer

at no point while processing do we JOIN on or include date in WHERE clauses

All partitions will need to be touched when the partitioning column is not specified in query predicates. Put yourself in SQL Server's shoes - how would you know which partitions to access (or not) without knowing the partitioning column value? Consider that a singleton index seek without partition elimination will seek against every partition. Compared to a non-partitioned table, this seek overhead will be more noticeable when few rows are returned. There won't be as much difference with full scans, though.

Would it not be more optimal to have a clustered primary key on recordID alone, does the partitioning field (date) have to be apart of the clustered index?

Yes, recordID alone would be more optimal but the partitioning column must be part of the clustered index key as well as all unique indexes. The implication is that primary key and unique constraints must include the partitioning column as part of the key.

it's only recordID that needs to be a primary key. To me date should not be present in the primary key or clustered index.

Without knowing your data I can't say if date should be part of the primary key or not. Assuming not from a data model perspective, you cannot partition the primary key index without it. One approach is to use a non-clustered non-partitioned primary key, considering the performance implications on queries.

Tiered storage for less frequently accessed data is a use case for table partitioning with different file groups. You will still need to develop a method to either physical move the underlying files or physically move rows from one filegroup to another. Partition SWITCH alone cannot do the job.

Related Solutions

Sql-server – SQL Server 2008 – Partitioning and Clustered Indexes

A partitioned table is really more like a collection of individual tables stitched together. So your in example of clustering by IncidentKey and partition by IncidentDate, say that the partitioning function splits the tables into two partitions so that 1/1/2010 is in partition 1 and 7/1/2010 is partition two. The data will be layed out on disk as:

Partition 1:
IncidentKey    Date
ABC123        1/1/2010
ABC123        1/1/2011
XYZ999        1/1/2010

Partition 2:
IncidentKey    Date
ABC123        7/1/2010
XYZ999        7/1/2010

At a low level there really are two, distinct rowsets. Is the query processor that gives the illusion of a single table by creating plans that seek, scan and update all rowsets together, as one.

Any row in any non-clustered index will have have the clustered index key to which it corresponds, say ABC123,7/1/2010. Since the clustered index key always contains the partitioning key column, the engine will always know in what partition (rowset) of the clustered index to search for this value (in this case, in partition 2).

Now whenever you're dealing with partitioning you must consider if your NC indexes will be aligned (NC index is partitioned exactly the same as the clustered index) or non-aligned (NC index is non-partitioned, or partitioned differently from clustered index). Non-aligned indexes are more flexible, but they have some drawbacks:

non-aligned indexes require large amounts of memory for certain query plans
non-aligned indexes prevent efficient partition switch operations

Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model:

aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)
all foreign keys referencing the partitioned table must include the partitioning key in the relation (since the partitioning key is, due to alignment, in every index), and this in turn requires that all tables referencing the partitioned table contain partitioning key column value. Think Orders->OrderDetails, if Orders have OrderID but is partitioned by OrderDate, then OrderDetails must contain not only OrderID, but also OrderDate, in order to properly declare the foreign key constraint.

These effects I found seldom called out at the beginning of a project that deploys partitioning, but they exists and have serious consequences.

If you think aligned indexes are a rare or extreme case, then consider this: in many cases the cornerstone of ETL and partitioning solutions is the fast switch in of staging tables. Switch in operations require aligned indexes.

Oh, one more thing: all my argument about foreign keys and the ripple effect of adding the partitioning column value to other tables applies equally to joins.

Sql-server – Does dropping a Primary Key constraint from a Partitioned Table break the partition

It shouldn't affect the table's partitioning, as the table was created on the partition scheme with the partitioning column specified.

Each partition has it's own HOBT (Heap or B-Tree). So essentially you have a HOBT for each Date in your table.

If the primary key is also a clustered index, you will be converting each partitions' B-Tree to a Heap. Depending upon the size of the table, moving the data from the B-Tree to the Heap could take a considerable amount of time and resources. It's recommended to drop all nonclustereds before attempting to drop the clustered index, then recreating them after (or disable / rebuild). Here's a detailed explanation from Microsoft:

"When a clustered index is dropped, the data rows that were stored in the leaf level of the clustered index are stored in an unordered table (heap). Dropping a clustered index can take time because in addition to dropping the clustered index, all nonclustered indexes on the table must be rebuilt to replace the clustered index keys with row pointers to the heap. When you drop all indexes on a table, drop the nonclustered indexes first and the clustered index last."

http://msdn.microsoft.com/en-us/library/ms190691%28v=SQL.90%29.aspx

I won't comment on query performance implications of doing this, as I'm not familiar with how the table is queried.

As with any major change, make sure to try this in the test environment before attempting to make this change.

Best Answer

Related Solutions

Sql-server – SQL Server 2008 – Partitioning and Clustered Indexes

Sql-server – Does dropping a Primary Key constraint from a Partitioned Table break the partition

Related Question