Sql-server – What’s a good indexing scheme for a type-2 SCD table with a very broad natural key

indexsql-server-2008

I have a table which links several fairly long text fields (~100 chars) to a single text field: { A, B, C, D } => E. The dependent value can change, so I query data daily and record A, B, C, D, EffectiveDate, E. My design right now has a clustered PK on { A, B, C, D, EffectiveDate }, in that order. I had thought that I could then query the most recent value easily:

SELECT
    X.A, X.B, X.C, X.D, X.E
FROM
    Tbl AS X
WHERE
    NOT EXISTS (SELECT * FROM Tbl AS X2 WHERE X.A = X2.A AND X.B = X2.B AND X.C = X2.C AND X.D = X2.D AND X2.EffectiveDate > X.EffectiveDate)

Since the self-join is on the first fields in the clustered key, I expected good performance, with a merge join. However, it's taking about two minutes with less than a million rows in the table. This is part of a near-real-time update, so I really need it to take 30 seconds, tops.

Is there a better indexing strategy for this scenario? I have a couple ideas which I have not tried, because of some combination of (A) I'm lazy and (B) I'd rather get the right answer from the wise people on Stack Exchange, so I'll know better in the future. I realize that the "best" solution will depend on exactly the data I have, but I suspect that there's a good general solution I'm missing, I just need to adjust my clustering somehow.

I could replace the wide PK with a synthetic integer key but keep the wide key for clustering, but my understanding is that this would only reduce the weight of any additional indices, of which this table has none.
I ~~could~~ should stop recording data every day, and instead only record changes. Values for E change every week or two on average, so I'm definitely bloating the data around tenfold.
I could create a synthetic key to collapse these text values to integers, or maybe use a hash to join first and then resolve collisions with the exact values.
I could shift to a type 4 SCD and do all the heavy lifting during overnight ETL.

Best Answer

Look at the order of the column indexes and ensure the most selective are first

SELECT
 COUNT(DISTINCT A),
 COUNT(DISTINCT B),
 COUNT(DISTINCT C),
 COUNT(DISTINCT D)
FROM
 X

Then, read this around memory allocations for varchar columns in queries: SQL Server VARCHAR Column Width (follow Martim Smith's link)

Finally, you could use HASHBYTES to collapse the columns into a hashed computed column. You can add this to the index too as the first column so this is highly selective because it's almost unique A surrogate key may be useful

Related Solutions

Sql-server – Composite or Single-Field Clustering Key

This is bollocks:

... ClaimId alone is a better clustered key because it is narrower

because of this

ClaimId alone is NOT uniquej

A non-unique clustered index will add a 4 byte uniquifier to remove ambiguity of ClaimId because it is the clustered index. Why? One reason is all NC indexes refer to it: so how to know ClaimId is which?

It was demonstrated (some time ago, maybe not valid now and can't find it) that non-unique clustered indexes break when you exhaust 2^32 values of the 4 byte uniquifier

Edit :

Question states ClaimId is not unique so assumed that uniqifier exists. No need to comment that it may not exist in the context of the question

SQL Server Partitioning – Partitioning and Clustered Indexes in SQL Server 2008

A partitioned table is really more like a collection of individual tables stitched together. So your in example of clustering by IncidentKey and partition by IncidentDate, say that the partitioning function splits the tables into two partitions so that 1/1/2010 is in partition 1 and 7/1/2010 is partition two. The data will be layed out on disk as:

Partition 1:
IncidentKey    Date
ABC123        1/1/2010
ABC123        1/1/2011
XYZ999        1/1/2010

Partition 2:
IncidentKey    Date
ABC123        7/1/2010
XYZ999        7/1/2010

At a low level there really are two, distinct rowsets. Is the query processor that gives the illusion of a single table by creating plans that seek, scan and update all rowsets together, as one.

Any row in any non-clustered index will have have the clustered index key to which it corresponds, say ABC123,7/1/2010. Since the clustered index key always contains the partitioning key column, the engine will always know in what partition (rowset) of the clustered index to search for this value (in this case, in partition 2).

Now whenever you're dealing with partitioning you must consider if your NC indexes will be aligned (NC index is partitioned exactly the same as the clustered index) or non-aligned (NC index is non-partitioned, or partitioned differently from clustered index). Non-aligned indexes are more flexible, but they have some drawbacks:

non-aligned indexes require large amounts of memory for certain query plans
non-aligned indexes prevent efficient partition switch operations

Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model:

aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)
all foreign keys referencing the partitioned table must include the partitioning key in the relation (since the partitioning key is, due to alignment, in every index), and this in turn requires that all tables referencing the partitioned table contain partitioning key column value. Think Orders->OrderDetails, if Orders have OrderID but is partitioned by OrderDate, then OrderDetails must contain not only OrderID, but also OrderDate, in order to properly declare the foreign key constraint.

These effects I found seldom called out at the beginning of a project that deploys partitioning, but they exists and have serious consequences.

If you think aligned indexes are a rare or extreme case, then consider this: in many cases the cornerstone of ETL and partitioning solutions is the fast switch in of staging tables. Switch in operations require aligned indexes.

Oh, one more thing: all my argument about foreign keys and the ripple effect of adding the partitioning column value to other tables applies equally to joins.

Best Answer

Related Solutions

Sql-server – Composite or Single-Field Clustering Key

SQL Server Partitioning – Partitioning and Clustered Indexes in SQL Server 2008

Related Question