The question is not 'when should the PK be NC', but instead you should ask 'what is the proper key for the clustered index'?
And the answer really depends on how do you query the data. The clustered index has an advantage over all other indexes: since it always includes all columns, is always covering. Therefore queries that can leverage the clustered index certainly do not need to use lookups to satisfy some of the projected columns and/or predicates.
Another piece of the puzzle is how can an index be used? There are three typical patterns:
- probes, when a single key value is seek-ed in the index
- range scans, when a range of key values is retrieved
- order by requirements, when an index can satisfy an order by w/o requiring a stop-and-go sort
So if you analyze your expected load (the queries) and discover that a large number of queries would use a particular index because they use a certain pattern of access that benefits from an index, it makes sense to propose that index as the clustered index.
Yet another factor is that the clustered index key is the lookup key used by all non-clustered indices and therefore a wide clustered index key creates a ripple effect and widens all the non-clustered indices and wide indices mean more pages, more I/O, more memory, less goodness.
A good clustered index is stable, it does not change during the lifetime of the entity, because a change in the clustered index key values means the row has to be deleted and inserted back.
And a good clustered index grows in order not randomly (each newly inserted key value is larger than the preceding value) as to avoid page splits and fragmentation (without messing around with FILLFACTOR
s).
So now that we know what a good clustered index key is, does the primary key (which is a data modelling logical property) match the requirements? If yes, then the PK should be clustered. If no, then the PK should be non-clustered.
To give an example, consider a sales facts table. Each entry has an ID that is the primary key. But the vast majority of queries ask for data between a date and another date, therefore the best clustered index key would be the sales date, not the ID. Another example of having a different clustered index from the primary key is a very low selectivity key, like a 'category', or a 'state', a key with only very few distinct values. Having a clustered index key with this low selectivity key as the leftmost key, e.g. (state, id)
, often makes sense because of ranges scans that look for all entries in a particular 'state'.
One last note about the possibility of a non-clustered primary key over a heap (i.e. there is no clustered index at all). This may be a valid scenario, the typical reason is when bulk insert performance is critical, since heaps have significantly better bulk insert throughput when compared to clustered indices.
I would suggest that the "right direction" would be to go a different direction entirely.
A sufficiently random and difficult-to-guess distribution of keys, assuming that is even an achievable objective, is likely to be otherwise sub-optimal for space and performance. But that's the second problem.
The first problem is that exposing the internals of your database in such a fashion is intrinsically unsafe.
Instead, create a separate column for this "PIN" -- of whatever desired length and data type, with a unique key constraint on the column so no more than one record can possibly have the same PIN associated with it.
You can then populate this value using whatever algorithm you want, and you can subsequently change the algorithm, change the min/max lengths of the value, etc., without any other impact on the rest of your application or database as a result of such changes.
Best Answer
Are you trying to create a key column that represents the a tree structure? If so, you could use concatenated, 32-bit hash values of each node in the tree to represent tre hierarchy.
For example, assume you wanted to represent the following:
You would store the following three rows with these keys:
...where + represents binary concatenation. You would then store the concatenated key in a VARBINARY(900), which leaves you space for a tree that is 900 / 4 = 225 nodes deep. Hopefully this would be enough? This can be indexed.
Note that you don't need the hash to be without collisions, as long as you can reduce the number of keys that must be searched.