Third Normal Form: Composite PRIMARY KEY vs System-Generated Surrogate (IDENTITY)

database-designnormalizationsql serversql server 2014

I’m working on a data modeling project and I’m trying to figure out the best data modeling approach for a history table, which has only four columns:

CREATE TABLE FooHistory
(
    SecurityID  INT (FK),     -- Part of the natural PK.
    FieldID     INT (FK),     -- Part of the natural PK.
    DateCreated DATETIME2(0), -- Part of the natural PK.
    Value       VARCHAR(50)
);

The natural composite KEY in this table would be (DateCreated, SecurityId, FieldID), and every 30 minutes the ETL process will add ~ 2K rows to this table.

Questions

  1. Pros and cons of declaring a composite PRIMARY KEY (PK) (DateCreated, SecurityId, FieldID) vs. adding a new IDENTITY column (that is, a system-generated surrogate) and use it as PK?

  2. I believe that, if I add an IDENTITY column and use it as PK, then this table won’t be in Third Normal Form (3NF), because there will be a functional dependency between non-PK columns, that is, (DateCreated, SecurityId, FieldID) and Value.

  3. Since this table keeps historical data I don’t expect to JOIN this table to other external tables, and applications will mainly interact with it using SELECT statements. Based on these assumptions is it worth it to keep the table in 3NF and declare a composite PK, or I should add the IDENTITY column to this table?

Best Answer

According to your description of the case under consideration, it seems that (a) appending a column to hold system-assigned surrogates* (surrogates for brevity) via an IDENTITY property to the History table is (b) superfluous.

A column retaining surrogates can be useful when a tentative natural PRIMARY KEY (PK) constraint has the potential of being referenced from FOREIGN KEY (FK) constraint definitions, but such tentative PK constraint is

  • wide, i.e., it encompasses a relatively large number of columns and/or
  • the values to be contained in the potential PK column(s) are physically heavy in terms of bytes (evidently, heavy values behave slower than light ones because of, e.g., larger memory and disk space usage concerning reading and writing processes, with the respective impact on the response speed of the logical data manipulation operations).

As a consequence, having FK constraints that point to a single column that (1) would somewhat work as a “substitute” for a natural KEY like the one described above and (2) would keep lighter values, is a possibility worth to assess. Regarding the History table at hand, it is neither logically wide nor is supported by a heavy physical scaffolding.

Besides, attaching a column for surrogates would necessarily make the table in question wider due to the incorporation of a tentatively needless element (and, perhaps?, of the speculative corresponding physical INDEX), so it might end up behaving slower. Hence, considering its effect on the overall performance of the database, it may not provide any pragmatic value regardless.

A situation in which this peculiar class of column can sometimes be of help is when sticked to tables that stand for entity types that are indpendent but, as you know, even in said kind of situation each particular table demands (i) individual and (ii) contextual analyses to determine whether sticking a column of that sort is convenient or not (and I guess that is why you asked the question).

At the logical level, a column that maintains surrogates, of course, neither captures more business domain meaning (which hinders readability and interpretation of result sets) nor protects real row uniqueness (a column that keeps meaningless values is an additional non-data artefact, so you still have to ensure row uniqueness of the genuine data elements). See this outstanding Stack Overflow answer by @PerformanceDBA for evidence about these aspects.

However, you “could” add it to the History table, “use” it as the PK and, at the same time, declare the combination of columns (DateCreated, SecurityId, FieldId) as an ALTERNATE KEY (AK) by means of (a) the configuration of a composite UNIQUE constraint and (b) the specification of every one of the columns in said combination as non-NULLable (arrangement that would actually guarantee row uniqueness with regard to the true data elements). In this way, the non-key column named Value would be functionally dependent on both

  • the column that retains surrogates, “utilized” as PK, and
  • the meaningful and natural AK made up of the three relevant columns;

thus, the table would still comply with 3NF§ anyway, although the usage of the column with surrogates would not be providing benefits, therefore it would be excessive.

Practical examples of why it would be excessive are:

  • The History table would mostly be queried by virtue of SELECT statements that include one or more of the columns that are part of the natural AK as conditions in the WHERE clause (while a column enclosing surrogates would hardly be included, because they are meaningles to end users).
  • INSERT operations INTO the table would be slowed down as a result of the integration of an extra column.

It is important to note that, at the physical level, establishing a good INDEXing strategy to serve the composite PK definition is paramount to optimize the functioning of the pertinent processes, hence some testing sessions should be carried out with respect to, e.g., the determination of the suitable order of the columns assisted by the applicable INDEX(es) configurations.


Endnotes

* System-assigned surrogates were evaluated in the 1979 paper entitled Extending the Database Relational Model to Capture More Meaning, which was written by Dr. Edgar Frank Codd, the creator of the relational paradigm. In said paper, it is stipulated that surrogate values must not be displayed to database users, even though in most databases I have seen that rule is not followed. The present answer is based on the assumption that, in the scenario under deliberation, surrogates are exposed just like ordinary business context values.

At the conceptual level, an entity type (or entity prototype) is independent when every one of its occurrences (i.e., rows, when retained in a SQL table) is uniquely identified by the values of one or more of its own attributes (i.e., columns) alone. A dependent one is that which requires the use of the value(s) of one (or more) attribute(s) that belong to other entity types so that each of its instances can be uniquely identified.

An ALTERNATE KEY is a column (or a combination of columns) that holds values that uniquely identify each row of the pertinent table but was not chosen as the PRIMARY KEY; each table can have zero, one or more ALTERNATE KEYS.

§ To analyze why this is so, see the genuine definitions of normal form and normalization in A Relational Model of Data for Large Shared Data Banks (1970) and of the successive normal forms (as suggested in Joel Brown’s subsequent answer) in Further Normalization of the Data Base Relational Model (1971), both works —naturally— by Dr. E. F. Codd. It is worth to mention that the original relational model (1970) does not include surrogates.