Sql-server – Fact table Primary Key: Surrogate Key or multi-column Natural Key

database-designinsertoptimizationprimary-keysql server

What's the best practice when choosing a fact table's Primary Key (PK for brevity): Should we have an auto-incremented Surrogate Key (SK), or should we use Natural Key's (NKs) columns?

The advantage I see in using a SK is INSERT performance, which is very relevant when we have almost a million rows inserted daily, which is my case. I'm expecting almost 20M rows per day, 200M rows per month, and the table must sustain data for decades to come without needing to be split.

But query performance should be better if we have a NK as PK, and the order of the PK is well designed.

Of course we can create non-clustered indexes for queries, but that eats storage and makes INSERT even slower.

Considering that INSERT and DELETE (yes, we have that too, a lot, because data is reprocessed and new files versions must be "updated") performance is very relevant, almost as much as query performance, what's best?

Other issue to consider is that when rows are deleted a SK count doesn't decrease. With 200M rows per month, I'll have 2,4G rows per year, which int datatype won't hold, so I'll need to use bigint for PK…

Best Answer

What's the context? It sounds like you're talking about a data warehouse. If so, I strongly recommend creating a synthetic key. I can think of three good reasons, offhand:

  • It will be narrower than the business key. While adding a column to a fact table should not be done lightly, because it will make your other indices narrower it may actually reduce the total size of your database.
  • Natural keys get dirty. Sooner or later someone will reuse a product code for something unrelated, or you'll need to start tracking sub-customers. If you have a mapping process between your natural keys and your synthetic keys, it's simple to account for these kinds of things.
  • You can use it to store synthetic values: -1 for "Not applicable", 0 for "Unknown", 1 for "Missing", 2 for "Illegible", etc. Depending on your natural keys, you might be able to do the same there, but there would always be a risk of collision.

48+ billion records is a lot. If you haven't already, you should consider aggregating data, perhaps after the current month. Synthetic keys could make this simpler too: perhaps values 1,000,000,001..1,999,999,999 are reserved for 2017/08 details, and in October the August data gets collapsed into a single record with ID 1,000,000,000 (this probably wouldn't be a good idea, it's just an example of how you may need to disconnect business keys from warehouse keys).

One last note: don't forget that your choices for PK and for clustering key by no means need to be the same! Each should be evaluated on its own merits.