Sql-server – What’s a good indexing scheme for a type-2 SCD table with a very broad natural key

indexsql-server-2008

I have a table which links several fairly long text fields (~100 chars) to a single text field: { A, B, C, D } => E. The dependent value can change, so I query data daily and record A, B, C, D, EffectiveDate, E. My design right now has a clustered PK on { A, B, C, D, EffectiveDate }, in that order. I had thought that I could then query the most recent value easily:

SELECT
    X.A, X.B, X.C, X.D, X.E
FROM
    Tbl AS X
WHERE
    NOT EXISTS (SELECT * FROM Tbl AS X2 WHERE X.A = X2.A AND X.B = X2.B AND X.C = X2.C AND X.D = X2.D AND X2.EffectiveDate > X.EffectiveDate)

Since the self-join is on the first fields in the clustered key, I expected good performance, with a merge join. However, it's taking about two minutes with less than a million rows in the table. This is part of a near-real-time update, so I really need it to take 30 seconds, tops.

Is there a better indexing strategy for this scenario? I have a couple ideas which I have not tried, because of some combination of (A) I'm lazy and (B) I'd rather get the right answer from the wise people on Stack Exchange, so I'll know better in the future. I realize that the "best" solution will depend on exactly the data I have, but I suspect that there's a good general solution I'm missing, I just need to adjust my clustering somehow.

  • I could replace the wide PK with a synthetic integer key but keep the wide key for clustering, but my understanding is that this would only reduce the weight of any additional indices, of which this table has none.
  • I could should stop recording data every day, and instead only record changes. Values for E change every week or two on average, so I'm definitely bloating the data around tenfold.
  • I could create a synthetic key to collapse these text values to integers, or maybe use a hash to join first and then resolve collisions with the exact values.
  • I could shift to a type 4 SCD and do all the heavy lifting during overnight ETL.

Best Answer

Look at the order of the column indexes and ensure the most selective are first

SELECT
 COUNT(DISTINCT A),
 COUNT(DISTINCT B),
 COUNT(DISTINCT C),
 COUNT(DISTINCT D)
FROM
 X

Then, read this around memory allocations for varchar columns in queries: SQL Server VARCHAR Column Width (follow Martim Smith's link)

Finally, you could use HASHBYTES to collapse the columns into a hashed computed column. You can add this to the index too as the first column so this is highly selective because it's almost unique A surrogate key may be useful