Sql-server – Natural Keys vs Surrogate Keys part 2

natural-keysql serversurrogate-key

A while back, I asked if surrogate keys provide better performance than natural keys in SQL Server. @sqlvogel provided an answer to that question yesterday that caused me to revisit it.

This question is an attempt to "upgrade" the prior question, and hopefully provide the opportunity for thoughtful answers that help the community.

Consider a system for storing details about computers. Each computer has an architecture, and an Operating System. In SQL Server, we could create these tables using natural keys like this:

CREATE TABLE dbo.Architecture
(
    ArchitectureName varchar(10) NOT NULL
    , ArchitectureVersion decimal(5,2) NOT NULL
    , ReleaseDate date NOT NULL
    , CONSTRAINT PK_Architecture
        PRIMARY KEY CLUSTERED
        (ArchitectureName, ArchitectureVersion)
);

CREATE TABLE dbo.Manufacturer
(
    ManufacturerName varchar(10) NOT NULL
        CONSTRAINT PK_Manufacturer
        PRIMARY KEY CLUSTERED
);

CREATE TABLE dbo.OS
(
    OSName varchar(30) NOT NULL
    , ManufacturerName varchar(10) NOT NULL
        CONSTRAINT FK_OS_Manufacturer
        FOREIGN KEY
        (ManufacturerName)
        REFERENCES dbo.Manufacturer(ManufacturerName)
    , ArchitectureName varchar(10) NOT NULL
    , ArchitectureVersion decimal(5,2) NOT NULL
    , CONSTRAINT FK_OS_Architecture
        FOREIGN KEY 
        (ArchitectureName, ArchitectureVersion)
        REFERENCES dbo.Architecture(ArchitectureName, ArchitectureVersion)
    , CONSTRAINT PK_OS
        PRIMARY KEY CLUSTERED
        (OSName)
);

CREATE TABLE dbo.Computers
(
    ComputerID varchar(10) NOT NULL
        CONSTRAINT PK_Computers
        PRIMARY KEY CLUSTERED
    , OSName varchar(30) NOT NULL
        CONSTRAINT FK_Computers_OSName
        FOREIGN KEY 
        REFERENCES dbo.OS(OSName)
    , ComputerManufacturerName varchar(10) NOT NULL
        CONSTRAINT FK_Computers_Manufacturer
        FOREIGN KEY 
        REFERENCES dbo.Manufacturer(ManufacturerName)
    , EffectiveDate datetime NOT NULL
        CONSTRAINT DF_Computers_EffectiveDate
        DEFAULT (GETDATE())
    , ExpiryDate datetime NULL
);

To query the dbo.Computers table, with 2 rows in dbo.Computers, showing various details, we could do this:

SELECT Computers.ComputerID
    , Computers.ComputerManufacturerName
    , OSManufacturer = OS.ManufacturerName
    , Computers.OSName
    , OS.ArchitectureName
    , OS.ArchitectureVersion
FROM dbo.Computers
    INNER JOIN dbo.OS ON Computers.OSName = OS.OSName
WHERE Computers.EffectiveDate <= GETDATE()
    AND (Computers.ExpiryDate >= GETDATE() OR Computers.ExpiryDate IS NULL)
ORDER BY Computers.ComputerID;

The query output is:

╔════════════╦══════════════════════════╦════════════════╦════════════╦══════════════════╦═════════════════════╗
║ ComputerID ║ ComputerManufacturerName ║ OSManufacturer ║ OSName     ║ ArchitectureName ║ ArchitectureVersion ║
╠════════════╬══════════════════════════╬════════════════╬════════════╬══════════════════╬═════════════════════╣
║ CM700-01   ║ HP                       ║ Microsoft      ║ Windows 10 ║ x64              ║ 1.00                ║
║ CM700-02   ║ HP                       ║ Microsoft      ║ Windows 10 ║ x64              ║ 1.00                ║
╚════════════╩══════════════════════════╩════════════════╩════════════╩══════════════════╩═════════════════════╝

The query plan for this is quite simple:

enter image description here

Or, if we choose to use surrogate keys, like this:

CREATE TABLE dbo.Architecture
(
    ArchitectureID int NOT NULL IDENTITY(1,1)
        CONSTRAINT PK_Architecture
        PRIMARY KEY CLUSTERED
    , ArchitectureName varchar(10) NOT NULL
    , ArchitectureVersion decimal(5,2) NOT NULL
    , ReleaseDate date NOT NULL
    , CONSTRAINT UQ_Architecture_Name
        UNIQUE
        (ArchitectureName, ArchitectureVersion)
);

CREATE TABLE dbo.Manufacturer
(
    ManufacturerID int NOT NULL IDENTITY(1,1)
        CONSTRAINT PK_Manufacturer
        PRIMARY KEY CLUSTERED
    , ManufacturerName varchar(10) NOT NULL
);

CREATE TABLE dbo.OS
(
    OS_ID int NOT NULL IDENTITY(1,1)
        CONSTRAINT PK_OS
        PRIMARY KEY CLUSTERED
    , OSName varchar(30) NOT NULL
        CONSTRAINT UQ_OS_Name
        UNIQUE
    , ManufacturerID int NOT NULL
        CONSTRAINT FK_OS_Manufacturer
        FOREIGN KEY
        REFERENCES dbo.Manufacturer(ManufacturerID)
    , ArchitectureID int NOT NULL
        CONSTRAINT FK_OS_Architecture
        FOREIGN KEY 
        REFERENCES dbo.Architecture(ArchitectureID)
);

CREATE TABLE dbo.Computers
(
    ComputerID int NOT NULL IDENTITY(1,1)
        CONSTRAINT PK_Computers
        PRIMARY KEY CLUSTERED
    , ComputerName varchar(10) NOT NULL
        CONSTRAINT UQ_Computers_Name
        UNIQUE
    , OS_ID int NOT NULL
        CONSTRAINT FK_Computers_OS
        FOREIGN KEY 
        REFERENCES dbo.OS(OS_ID)
    , ComputerManufacturerID int NOT NULL
        CONSTRAINT FK_Computers_Manufacturer
        FOREIGN KEY 
        REFERENCES dbo.Manufacturer(ManufacturerID)
    , EffectiveDate datetime NOT NULL
        CONSTRAINT DF_Computers_EffectiveDate
        DEFAULT (GETDATE())
    , ExpiryDate datetime NULL
);

In the design above, you may notice we have to include several new unique constraints to ensure our data model is consistent across both approaches.

Querying this surrogate-key approach with 2 rows in dbo.Computers looks like:

SELECT Computers.ComputerName
    , ComputerManufacturerName = cm.ManufacturerName
    , OSManufacturer = om.ManufacturerName
    , OS.OSName
    , Architecture.ArchitectureName
    , Architecture.ArchitectureVersion
FROM dbo.Computers
    INNER JOIN dbo.OS ON Computers.OS_ID = OS.OS_ID
    INNER JOIN dbo.Manufacturer cm ON Computers.ComputerManufacturerID = cm.ManufacturerID
    INNER JOIN dbo.Architecture ON OS.ArchitectureID = Architecture.ArchitectureID
    INNER JOIN dbo.Manufacturer om ON OS.ManufacturerID = om.ManufacturerID
WHERE Computers.EffectiveDate <= GETDATE()
    AND (Computers.ExpiryDate >= GETDATE() OR Computers.ExpiryDate IS NULL)
ORDER BY Computers.ComputerID;

The results:

╔══════════════╦══════════════════════════╦════════════════╦════════════╦══════════════════╦═════════════════════╗
║ ComputerName ║ ComputerManufacturerName ║ OSManufacturer ║ OSName     ║ ArchitectureName ║ ArchitectureVersion ║
╠══════════════╬══════════════════════════╬════════════════╬════════════╬══════════════════╬═════════════════════╣
║ CM700-01     ║ HP                       ║ Microsoft      ║ Windows 10 ║ x64              ║ 1.00                ║
║ CM700-02     ║ HP                       ║ Microsoft      ║ Windows 10 ║ x64              ║ 1.00                ║
╚══════════════╩══════════════════════════╩════════════════╩════════════╩══════════════════╩═════════════════════╝

enter image description here

The I/O statistics are even more telling. For the natural keys, we have:

Table 'OS'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  
Table 'Computers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  

For the surrogate key setup, we get:

Table 'Manufacturer'. Scan count 0, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Architecture'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'OS'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Computers'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Quite clearly, in the above admittedly very simple setup, the surrogate key is lagging in both ease-of-use, and performance.

Having said that, what happens if we need to change the name of one of the manufacturers? Here's the T-SQL for the natural key version:

UPDATE dbo.Manufacturer 
SET ManufacturerName = 'Microsoft­™' 
WHERE ManufacturerName = 'Microsoft';

And the plan:

enter image description here

The T-SQL for the surrogate key version:

UPDATE dbo.Manufacturer 
SET ManufacturerName = 'Microsoft­™' 
WHERE ManufacturerID = 1;

And that plan:

enter image description here

The natural key version has an estimated subtree cost that is nearly three times greater than the surrogate key version.

Am I correct in saying that both natural keys and surrogate keys offer benefits; deciding which methodology to use should be carefully considered?

Are there common situations where the comparisons I made above don't work? What other considerations should be made when choosing natural or surrogate keys?

Best Answer

I think your question title, apart from the fact that it is not a question, is a bit misleading. You're not measuring performance of natural/surrogate keys, unlike the question you're referencing. Instead you're measuring performance of queries against two databases with notably different models, caused by the choice of their respective primary keys. What is the purpose of the Manufacturer entity in your model? I understand that it's a contrived example, but that is exactly its weakness as an example.

Clearly a query that needs 3 joins instead of 1 will require more I/O. Same (or worse) is with updating a parent key value.

Your first example essentially shows the effect of "denormalization" on query performance, which is trivial. Sure, it's not denormalization strictly speaking, but it effectively amounts to merging contents of what normally would be a lookup table (Manufacturer) into the table referencing it, avoiding a join to look up the manufacturer name. Ditto Architecture.

The second example affirms the opinion that primary keys should never be updated. Try instead

UPDATE dbo.Manufacturer 
SET ManufacturerID = 1 
WHERE ManufacturerID = 1;

and you'll probably get the same outcome as you do with the natural key.

Finally, to answer your question as stated: yes, so what?