Sql-server – Difference between clustered index and partitioning column

partitioningsql serversql-server-2012

As I understand partitioning, the column that you partition on is how things are divided into partitions.

But does the partitioning column have anything to do with order on the disk? (Aside from which partition the data goes into.)

Say I have a table that looks like this:

CREATE TABLE [dbo].[Something](
    [SomethingId] [bigint] IDENTITY(1,1) NOT NULL,
    [OtherThingId] [bigint] NOT NULL,
    [CreatedBy] [int] NOT NULL,
    [CreatedWhen] [datetime] NOT NULL,
    [CreatedWhere] [varchar](255) NOT NULL,
     CONSTRAINT [PK_Something] PRIMARY KEY CLUSTERED 
    (
        [SomethingId] DESC
        [OtherThingId] DESC,
    )
) ON OtherThingIdPartitionScheme(OtherThingId)

The partition that a new row will be stored on is decided by OtherThingId.

But I setup the clustered index to be first SomethingId, then by OtherThingId.

Does that mean that each partition will be ordered by SomethingId, then by OtherThingId?
(Even though it is partitioned only by the secondary value of OtherThingId.)

(I am setting up a lot of partitioning right now and I want to be sure I fully understand it.)

Best Answer

But does the partitioning column have anything to do with order on the disk?

From Clustered Index Structures :

Clustered indexes have one row in sys.partitions, with index_id = 1 for each partition used by the index. By default, a clustered index has a single partition. When a clustered index has multiple partitions, each partition has a B-tree structure that contains the data for that specific partition. For example, if a clustered index has four partitions, there are four B-tree structures; one in each partition.

From Table and Index Organization:

When a table or index uses multiple partitions, the data is partitioned horizontally so that groups of rows are mapped into individual partitions, based on a specified column. The partitions can be put on one or more filegroups in the database. The table or index is treated as a single logical entity when queries or updates are performed on the data.

The pages in the data chain and the rows in them are ordered on the value of the clustered index key. All inserts are made at the point where the key value in the inserted row fits in the ordering sequence among existing rows. The page collections for the B-tree are anchored by page pointers in the sys.system_internals_allocation_units system view.

If you want to really dig more into how data is layed out then refer to : Inside The Storage Engine: sp_AllocationMetadata

As a side note : SQL Server 2016 CTP2 has TRUNCATE TABLE ... [ WITH ( PARTITIONS ( { | }

For automating partitioning switching check out - SQL Server Partition Management utility

Related Solutions

Sql-server – SQL Server 2008 – Partitioning and Clustered Indexes

A partitioned table is really more like a collection of individual tables stitched together. So your in example of clustering by IncidentKey and partition by IncidentDate, say that the partitioning function splits the tables into two partitions so that 1/1/2010 is in partition 1 and 7/1/2010 is partition two. The data will be layed out on disk as:

Partition 1:
IncidentKey    Date
ABC123        1/1/2010
ABC123        1/1/2011
XYZ999        1/1/2010

Partition 2:
IncidentKey    Date
ABC123        7/1/2010
XYZ999        7/1/2010

At a low level there really are two, distinct rowsets. Is the query processor that gives the illusion of a single table by creating plans that seek, scan and update all rowsets together, as one.

Any row in any non-clustered index will have have the clustered index key to which it corresponds, say ABC123,7/1/2010. Since the clustered index key always contains the partitioning key column, the engine will always know in what partition (rowset) of the clustered index to search for this value (in this case, in partition 2).

Now whenever you're dealing with partitioning you must consider if your NC indexes will be aligned (NC index is partitioned exactly the same as the clustered index) or non-aligned (NC index is non-partitioned, or partitioned differently from clustered index). Non-aligned indexes are more flexible, but they have some drawbacks:

non-aligned indexes require large amounts of memory for certain query plans
non-aligned indexes prevent efficient partition switch operations

Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model:

aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)
all foreign keys referencing the partitioned table must include the partitioning key in the relation (since the partitioning key is, due to alignment, in every index), and this in turn requires that all tables referencing the partitioned table contain partitioning key column value. Think Orders->OrderDetails, if Orders have OrderID but is partitioned by OrderDate, then OrderDetails must contain not only OrderID, but also OrderDate, in order to properly declare the foreign key constraint.

These effects I found seldom called out at the beginning of a project that deploys partitioning, but they exists and have serious consequences.

If you think aligned indexes are a rare or extreme case, then consider this: in many cases the cornerstone of ETL and partitioning solutions is the fast switch in of staging tables. Switch in operations require aligned indexes.

Oh, one more thing: all my argument about foreign keys and the ripple effect of adding the partitioning column value to other tables applies equally to joins.

Sql-server – Partition Function vs. Clustered Index

Partitioning a table only divides it into "chunks" based on the partition function. The clustered index will give order to the data within each partition.

If you're planning to run queries that involve parts of a partition (i.e., show me sales between Jan 5th and Jan 12th), then it can be advantageous to those queries to have the date as the leading column of the clustering key. This type of structure will result in clustered index seeks, instead of partition scans. (Assuming there are no other suitable indexes on the table.)

If the queries only touch entire partitions, it doesn't matter as much, as partition elimination is enough to isolate the needed data. That said, ordering by date first could eliminate the need for an expensive sort operation, depending on what you're doing.

But this also depends on which columns you need from the table. If you only need to do something like aggregate total sales amounts over a date range, it may be sufficient to use a covering nonclustered index with the date as the leading (or only) column of the key, instead of rebuilding the entire table just for that.

If you do change the clustered index, this will affect singleton lookups (probably by the identity column which I assume is the primary key) as they will now involve a nonclustered index seek + key lookup. If this type of activity isn't a major part of the workload, this will be fine, but you have to be really careful that not too many rows are selected by these queries, or the optimizer will revert to a partition scan on the assumption that it's cheaper. Again, depending on which columns you need, it may be advantageous to create a covering nonclustered index that includes only the columns you need.

Best Answer

Related Solutions

Sql-server – SQL Server 2008 – Partitioning and Clustered Indexes

Sql-server – Partition Function vs. Clustered Index

Related Question