What's the best practice when choosing a fact table's Primary Key (PK for brevity): Should we have an auto-incremented Surrogate Key (SK), or should we use Natural Key's (NKs) columns?
The advantage I see in using a SK is INSERT performance, which is very relevant when we have almost a million rows inserted daily, which is my case. I'm expecting almost 20M rows per day, 200M rows per month, and the table must sustain data for decades to come without needing to be split.
But query performance should be better if we have a NK as PK, and the order of the PK is well designed.
Of course we can create non-clustered indexes for queries, but that eats storage and makes INSERT even slower.
Considering that INSERT and DELETE (yes, we have that too, a lot, because data is reprocessed and new files versions must be "updated") performance is very relevant, almost as much as query performance, what's best?
Other issue to consider is that when rows are deleted a SK count doesn't decrease. With 200M rows per month, I'll have 2,4G rows per year, which int datatype won't hold, so I'll need to use bigint for PK…
Best Answer
What's the context? It sounds like you're talking about a data warehouse. If so, I strongly recommend creating a synthetic key. I can think of three good reasons, offhand:
48+ billion records is a lot. If you haven't already, you should consider aggregating data, perhaps after the current month. Synthetic keys could make this simpler too: perhaps values 1,000,000,001..1,999,999,999 are reserved for 2017/08 details, and in October the August data gets collapsed into a single record with ID 1,000,000,000 (this probably wouldn't be a good idea, it's just an example of how you may need to disconnect business keys from warehouse keys).
One last note: don't forget that your choices for PK and for clustering key by no means need to be the same! Each should be evaluated on its own merits.