“Transitive” keys

database-designforeign keyprimary-key

I have a central table, A, which will have a key that I'll call A_ID. Now, I have a table B in a "one to many" relationship with A (many B's map to one A). Each link will, of course, have a link back to A_ID. But now I have a third table C in a "one to many" relationship with B. So, in B, I need at a minimum, a key B_ID so each C can link to the appropriate B. In this way, every A would have a unique A_ID and every B would have to have a unique B_ID.

But maybe a better way would be to let B_ID be unique within all the B's that link to a particular A. Then the B key would be a combination of A_ID and B_ID. And then I would have to make space in C for both A_ID and B_ID so each C could map back to its appropriate B.

I see advantages and disadvantages both ways. The second method requires a bit more storage and duplication. The first way would require more work from the user adding in new data. (There's going to be alot of new data added in.) Any thoughts? Is there a "preferred" way to handle this situation? Thanks.

In both models:

CREATE TABLE A
(
A_ID    INT NOT NULL IDENTITY PRIMARY KEY
)

Model A

CREATE TABLE B
(
A_ID    INT NOT NULL REFERENCES A (A_ID),
B_ID    INT NOT NULL IDENTITY PRIMARY KEY
)

CREATE TABLE C
(
-- No direct relationship to A
B_ID    INT NOT NULL REFERENCES B (B_ID),
C_ID    INT NOT NULL IDENTITY PRIMARY KEY
)

Model B

CREATE TABLE B
(
A_ID    INT NOT NULL REFERENCES A (A_ID),
B_ID    INT NOT NULL,
PRIMARY KEY (A_ID, B_ID)
)

CREATE TABLE C
(
A_ID    INT NOT NULL,
B_ID    INT NOT NULL,
PRIMARY KEY (A_ID, B_ID), -- Note sure about this line -- OP
FOREIGN KEY (A_ID, B_ID) REFERENCES B (A_ID, B_ID)
)

Best Answer

Both models have their distinct advantages and disadvantages. They also model different things.

Modeling Interpretation

In Model A, multiple C can exist for the same B, irrespective of what B's relationship to A is. As an example of Model A, consider a living room with objects in it. If A is the room and B is a TV - then a certain TV (B) may be in one and only one room (C) - but a room may contain multiple TVs. In this example, think of C as remote for the TV. It works with a specific TV and even if the TV is moved to another room, it continues to work with that TV. The remote does not care about which room it is in - it only cares about the TV it is designed for.

In Model B, C explicitly requires both a certain A and a certain B to exist for C to make sense. Also, the way you have modelled it here, there can be only one C for every B (if not, A_ID, B_ID would not be unique in C) Using our example, A is still the room and B is still a TV. But in this case, C would be the wall mount for the TV. The wall mount may or may not be there (there is nothing forcing the existence of C in the model). If the wall mount is there, it belongs to both the room and the TV. A change in any of them, will force a change in the wall mount.

Programability and convenience

In Model A, it is programatically easier to find a specific instance of C - because you don't need to worry about a composite key. This also means that any secondary indexes on the table are cheaper when using clustered indexes. However, if you enforce cascading deletes integrity on the database, these have have to be transitive which can cause some interesting locking scenarios.

In Model B, you will have to worry about composite keys. Programmers get this wrong all the time, and even when they don't, they often throw a DISTINCT in there for good measure. This can kill your performance or introduce subtle bugs, unless the programmers are careful. However, Model B creates an opportunity for a tuning technique known as co-location. Because A_ID is present in all tables, we can physically place the data that is joined in an optimal manner - by making sure everything with the same A_ID is stored in the same place (for example on the same server). This is useful in scenarios like sharding and MPP data warehouse systems. It means that even if the modelling interpretation of Model B is not exactly what we want, we may still adopt it purely for performance reasons (when we know what we are doing).