A partitioned table is really more like a collection of individual tables stitched together. So your in example of clustering by IncidentKey
and partition by IncidentDate
, say that the partitioning function splits the tables into two partitions so that 1/1/2010 is in partition 1 and 7/1/2010 is partition two. The data will be layed out on disk as:
Partition 1:
IncidentKey Date
ABC123 1/1/2010
ABC123 1/1/2011
XYZ999 1/1/2010
Partition 2:
IncidentKey Date
ABC123 7/1/2010
XYZ999 7/1/2010
At a low level there really are two, distinct rowsets. Is the query processor that gives the illusion of a single table by creating plans that seek, scan and update all rowsets together, as one.
Any row in any non-clustered index will have have the clustered index key to which it corresponds, say ABC123,7/1/2010
. Since the clustered index key always contains the partitioning key column, the engine will always know in what partition (rowset) of the clustered index to search for this value (in this case, in partition 2).
Now whenever you're dealing with partitioning you must consider if your NC indexes will be aligned (NC index is partitioned exactly the same as the clustered index) or non-aligned (NC index is non-partitioned, or partitioned differently from clustered index). Non-aligned indexes are more flexible, but they have some drawbacks:
Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model:
- aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)
- all foreign keys referencing the partitioned table must include the partitioning key in the relation (since the partitioning key is, due to alignment, in every index), and this in turn requires that all tables referencing the partitioned table contain partitioning key column value. Think Orders->OrderDetails, if Orders have OrderID but is partitioned by OrderDate, then OrderDetails must contain not only OrderID, but also OrderDate, in order to properly declare the foreign key constraint.
These effects I found seldom called out at the beginning of a project that deploys partitioning, but they exists and have serious consequences.
If you think aligned indexes are a rare or extreme case, then consider this: in many cases the cornerstone of ETL and partitioning solutions is the fast switch in of staging tables. Switch in operations require aligned indexes.
Oh, one more thing: all my argument about foreign keys and the ripple effect of adding the partitioning column value to other tables applies equally to joins.
We have a similar situation as the one you described, except that these copied tables are only used for ETL (by select queries). I'll describe how we approached it, but you have to evaluate if it works for your environment (see Jon Seigel's comment at your question).
The concept is straightforward:
- We index these "copy" tables in the staging area to support queries as much as possible. Most of indexes (especially on the large tables which get hit by scans) are columnstore indexes (we're on 2012 as well), some are covering nonclustered indexes. As I said, these tables are (for the time being) only used for select queries in the ETL, so we can make sure every query is supported either by a columnstore index or a covering nonclustered index.
- Before the data pump (copy) from source systems into the staging area, we drop the indexes, to make bulk loads as fast as possible and to ensure we're touching source systems for the minimal amount of time. After the data pump is complete, we recreate them - the logic is set up so that the recreation is done "dynamically" (as in, indexes are not hardcoded - every index that gets dropped is then recreated exactly as it was), which allows us to change indexing without having to maintain the drop/create code.
This approach works for us, drop/recreate is faster than loading into indexed tables; well in the case of a columnstore index you don't have many other options anyway apart from partitioning & using partition switch, but we're not doing that for these tables (EDW fact tables are a different story).
Best Answer
Pretty much every implementation decision is a trade off between competing factors. Building a columnstore index is CPU intensive but afterwards queries touching many rows are fast, and updates are slow. Which is most important for your workload, on average? Is there a time window in which that amount of CPU can be consumed without breaking other parts of the system? Is the additional cost repaid in future benefit? What is the problem to which a columnstore is the solution, and has this been addressed in other aspects of the system?
You mention ETL. Often such tables are only processed as scans where every row is touched by every operation. In such cases indexes will slow processing because they must be written in addition to the table.
Likely there are reasons for the current restrictions. Understand why they were put in place. If those circumstances no longer hold (maybe a server version upgrade or better hardware now) try an experiment in a test environment.