Sql-server – SQL Server : primary keys advice to the whitepaper needed

database-designdatabase-theorydatabase-tuningsql server

I've tired to explain to every new junior developer in our r&d team why he should use Primary Keys and how to do that. So I decided to write small whitepaper, which every new developer should read. Here is the draft of it.

DISCLAIMER:

  1. I know and understand the difference between clustered index and primary key
  2. In the following question PRIMARY KEY means PRIMARY KEY CLUSTERED, and WITHOUT PRIMARY KEY means WITHOUT PK AND CLUSTERED INDEX
  3. Note that it is a whitepaper for junior and not SQL programmers
  4. All the things that they do – reviewed before entering to the main development branch
  5. I'm not going to explain to them – when there will be benefit of using clustered index and NONclustered primary key – else I will fall to premature optimizations hell

The question is – what else should I add to the document? May be – what I should change in it? And what to explain more in detail?

Here comes the draft:

  • Primary keys have to constrain any table within database.
  • Without PK a table considered as a heap and SQL Server has very limited uses of this type of data – the only thing I should say it is suitable – a buffer for fast bulk-loading data from the outside of SQL Server's engine.
  • Avoid using natural primary keys. Primarily because of theirs natural (gauss) distributions. For example in PhoneBook table with primary key based on Family and Name will be many Smiths and Wilsons and much fewer Zimmerbergs and this states, that pages containing Smiths and Wilsons will be splitted more often, than other pages, and queried also more often – which multiplies the performance impact. That leads to performance degradation, because of primary key's page fullness and most of searches will hit PK's sparsed pages.

Moreover, even using SSN or ID Number, which have (I hope) flat distribution, as PK does not solve the problem of PK's page splitting, because this numbers are not in any organized order.

  • Secondary – natural PKs are often composite. That creates composite foreign keys and wide indexes and as a result hurts performance. So, …
  • Avoid using composite PKs. Better using simple surrogate PK and composite UNIQUE index, than all-in-one composite PK because it leads to composite FK and wide indexes due to statement, that every secondary index on table with PK has to include whole PK within.
  • Avoid using surrogate primary keys other than integer or uniqueidentifier types
  • During database design phase it is very important to identify entities and corresponding tables which may have in perspective "zillions" of rows, or tables which keys have to be NOT ONLY table-wide unique, but DB or even world unique, or which have to be joined with other tables by this key over several hop-tables – this tables better to have UUID PK, others – ordinary integer, because SQL Server very well fine tuned to use integer PKs.
  • These two types can guarantee both vector distribution (IDENTITY or NEWSEQUENTIALID()) and monotonous sequence.
  • Main database design rule – Twenty minutes spent to well thought-out design will save days or even weeks during production database's maintenance

Best Answer

Your points are unrelated to database design: choice of natural or surrogate key is an implementation decisions after conceptual and logical models are complete

In addition to comments and other answers:

  • some natural keys work well such as currency or language codes (CHF, GBP, DE, EN etc)
  • avoiding composite keys forces you to always join intermediate tables (rather than simple) parent-grandchild
  • adding a surrogate key in unnecessary for link tables

Edit: example of "composite keys"

Assume: t1 has child t2 has child t3

  • If you had the key of t1 in t3 (composite key) you can join t1 and t3 directly.
    t1 key is also the left hand column of t3 key so you don't need an extra index

  • With a surrogate key/FK, you have to join via t2
    You need extra indexes on the FK columns in t2 and t3 which

This latter option with the "always use surrogate key" dogma

  • adds complexity
  • decreased or reverses disk space "savings"