Should Auto-Increment/IDENTITY Field Be Added to Cross-Reference Table?

clustered-indexprimary-keysql servert-sql

I'm adding the following cross-reference table to my SQL Server-hosted DB:

company_id bigint not null (FK)
org_path nvarchar (2048) not null

The company_id field refers to the id field in another table (in which it's the primary key).

Given that there can also be multiple records with the same company_id, any primary key would have to use both fields. However, I'm unable to create a key using both fields because org_path is too long for SQL Server.

As for org_path, this is the only table in which it exists. There's every likelihood that queries to this table will be asking for either all entries or all org_path entries by company_id. Or to put it another way, it looks doubtful that this table will ever be queried by org_path. Furthermore, it's unlikely that org_path will be updated, and more likely inserted and – probably rarely – deleted.

I expect that the total number of rows will be in the low thousands.

Also, the reason it's nvarchar (2048) is because the value has to mimic that in a third-party DB. A typical example will be something like

\Translation Providers\[customer name]\[order name]\

and can contain diacritics.

So my question is this: would it be more efficient to add an auto-increment id field and use that in conjunction with company_id as the primary key, or would it add unnecessary overhead – and does the fact that company_id is the primary key in another table have any effect here?

Best Answer

For a non-unique clustered index on comany_id alone, SQL Server will automatically add a 4 byte integer uniqueifier to all duplicate (i.e. second and subsequent for a key value) clustered index keys to make it unique. This is not exposed to the user though.

The advantage of adding your own unique identifier as a secondary key column is that you can then still seek by company_id but also seek to individual rows more efficiently (using company_id, identitycol rather than company_id with a residual predicate on org_path). The clustered index would then be unique on company_id, identitycol, so no hidden uniqueifiers would be added.

Also, if you end up with duplicates for (company_id,org_path), having the explicit identity column (a sort of "exposed uniqueifier") will make it easier to target just one of them for delete or update.