Technical Challenges of Composite Keys in SQL Server

database-designsql server

I'm sure this question has been asked before, but I'd rather put a twist on it to keep it not opinion based, and as informative as possible. (Also please forgive any mistakes I make in annotation, I'm still fairly new to this and want to get better.)

Say I'm creating a database (SQL Server, a RDBMS I'm not familiar with) with a set of users, and for the sake of simplicity, I'm just keeping track of some APIs that these users use. Say I have API's Alpha, Beta, Gamma, and Delta. Each of these have keyed access. I need to associate the keys with the users. My gut instinct would be to split the data as such in schema A:

,-------,         ,----------,
|*Users*|         |*APIkeys* |
|-------| 1       |----------|
| id^   |----\    | key^     |
| fname |    |    | service^ | 
| lname |  <Has>  | queries  |
| email |    |    | ...      |
| ...   |    \--<=| userid   |
| ...   |       n | ...      |
|_______|         |__________|   ( where ^ indicates primary key )

In this schema, the APIs table has a composite primary key, which makes sense from a high level perspective. I'm keeping track of APIs and Users. Each API has a unique key to its service, which seems reasonable to turn into a composite key.

There is alternative schema B, which is where instead of having one APIkeys table, I have 4 separate keys table, one for Alpha, Beta, Gamma, and Delta. This leads to a lot more tables, but avoids having the composite keys.

However, after thinking some more about the situation, I also have an alternate relational schema, Schema C:

,-------,        ,----------,        ,----------,
|*Users*|        |*UtoArel* |        |*APIkeys* |
|-------| 1    n |----------|      1 |----------|
| id^   |------<=| userID^  | n  /---| keyID^   |
| fname |        | keyID^   |=>-/    | service  |
| lname |        |__________|        | queries  |
| email |                            | queries  |
| ...   |                            | ...      |
|_______|                            |__________|    ( where ^ indicates primary key )

So now my questions are these (also, please keep in mind I'm using SQL Server, but I'd also like to know "in general" as well):

What are the advantages of schema A over B? A over C?
Are there any pitfalls of using composite keys? (It would be great if you can tie it in with this scenario too.)
Are there any other schemas that I missed that would be better?

And of course, any other advice would be welcome, I'm happy to learn.

Best Answer

The question conflates composite keys with schema design. The two are separate.

All keys serve the same purpose. Their value(s) uniquely identify a single instance of the type to which they refer. This is true whether the key is single or composite, natural or surrogate.

It can be that multiple keys are identified for one table. Collectively these are known as candidate keys. Some of these likely will be natural keys, identified during logical analysis phases of the project. Some candidate keys will be surrogate (or composites of surrogates, or surrogates and natural keys) likely created during physical modelling for performance or technical reasons.

Having identified entities and their candidate keys and normalised them we can consider how the entities relate to each other. Here you can do yourself a big favour by avoiding wishy-washy names like 'has'. What is the real-world business term for the relationship between a single instance of User and a single instance of API? Complete these sentences:

Each User <relationship> [one | one or more] API.  
Each API <relationship> [one | one or more] User.

If the relationship is "administers" then likely its many-to-many and schema C is appropriate. If the relationship is "logs in using" I would hope each user has their own key so schema A may be appropriate. It may be in the system we're writing we want to track both administration and credentials so parts of both A and C are needed. My point being it is the business rules and the cardinality that determines table layout. Key composition plays no part.

Now we can come to the physical design of tables. Here's where composite keys matter at last. In a relational database tables are linked by foreign keys. The column(s) that make up a key of one table are included in the other table^*. It follows that if we choose to use a candidate key that is a composite of many long columns there will be an overhead compared to using another candidate which is, say, a single integer. More disk space will be used, more working memory will be required for queries, join conditions will be more complicated and buggier etc. Conversely, if we insist on using only integer surrogate keys there must be continuous conversion between these internal representations, in-coming search terms, and out-going results all of which add load to the system. The balance between these factors is as much art as it is science.

In SQL Server specifically each table either has a clustered index or it does not. There are several considerations which strongly mitigate toward having nearly all tables as clustered. This decided there are then good reasons to have the clustering key unique and as small as possible i.e. an integer surrogate key. This then naturally leads to the same key being used as the foreign key in all relationships. This, unfortunately, becomes reflex after a while and every table has an integer primary key clustering index, usually IDENTITY() for good measure.

There no requirement that every foreign key reference to a table must reference the same candidate key. However, every candidate thus referenced must have a unique constraint. The index behind these constraints have write overhead so it is usual that the same parent candidate key is used as a foreign key every child table.

In conclusion, I would say there is no advantage of schema A over B or C. Each either implements the requirements of the task at hand and is correct or it does not and it is wrong. If it implements them efficiently for the DBMS in use then so much the better.

^* Colloquially called "parent" and "child" the proper terms are "referenced" and "referencing".

Related Solutions

Sql-server – Indexing – Uniqueidentifier Foreign Key or Intermediary mapping table

Ok, I am making a lot of assumptions (INT instead of VARCHAR(50) being one of them) with this answer, so feel free to correct me if needed. The problem with option B is that it introduces a new join to relate Users to Alerts without any real added benefit. If joining on the UserID, it is best to index the UserID, so you can utilize seeks for your joins.

For Option A, UserID will be the clustering key (index key for the clustered index) on the Users table. UserID will be a nonclustered index key on Alerts table. This will cost 16 bytes per Alert.

For Option B, UserID will be the clustering key on the Users table. UserId will probably be the clustering key in UserMap too, to make joining more efficient. UserKey (assuming this is an INT) would then be a nonclustered index key on the Alerts table. This will cost 4 bytes per Alert. And 20 bytes per UserMap.

Looking at the big picture, one relationship, for Option A, costs 16 bytes of storage, and involves 1 join operation. Whereas, one relationship, for Option B, costs 24 bytes of storage, and involves 2 join operations.

Furthermore, there are a possibility of 340,282,366,920,938,000,000,000,000,000,000,000,000 uniqueidentifiers and only 4,294,967,296 INTs. Implementing a uniqueidentifier to INT map for a this type of relationship could cause unexpected results when you start reusing INTs.

The only reason for creating this type map table, is if you plan on creating a Many to Many relationship between Users and Alerts.

Taking all of this into consideration, I would recommend Option A.

I hope this helps,

Matt

Database Design – Implementing a Voting System Like Reddit

What you have seems pretty solid. How about

create table vote (
   user_id int not null,
   content_id int not null,
   time_recorded datetime not null,
   vote int not null,
   primary key (user_id, content_id),
   foreign key (user_id) references `user`(user_id)
      on delete cascade,
   foreign key (content_id) references content(content_id)
      on delete cascade,
   check (vote = 1 or vote = -1)
);

A composite primary key with those two keys makes perfect sense. They absolutely should be foreign keys as well. I threw in to cascade a delete from the parent tables; if a user deletes, the "like" will be removed from the content (which, IIRC, is what happens in StackExchange), and if the content is removed, there is no real point in keeping the votes on it. I also implemented the vote column as an integer which must be either 1 or -1, rather than a boolean. That way you can get an overall tally of up and down votes with a simple sum(vote), rather than two select count(*).

Best Answer

Related Solutions

Sql-server – Indexing – Uniqueidentifier Foreign Key or Intermediary mapping table

Database Design – Implementing a Voting System Like Reddit

Related Question