Database Design – Should Referencing Table Have Primary Key, Auto Increment, or Index?

database-design

I'm trying to implement a list. Each list item will be associated with one or more tags. And each tag will be associated with one or more list items.

I'm still pretty new to this field. Fortunately, I managed to find a solution in this question, whose answer suggests the following database design:

Articles table
- id
- name
- etc

Relational Tag Arts table
- Fkey tags id
- Fkey article id

Tags table
- id
- name
- etc

I already created my three tables. Now what I don't understand is that, should the referencing table (the one that contains two foreign keys) have primary key, auto increment, or index?

[Solved] And although I'm not 100% sure, I guess the referencing table's content is purposely "repetitive". For example, if I add a new item which is associated with 5 tags, then the referencing table should be added with 5 more records, which all have the same item id but have different tag id's. Am I correct?

Best Answer

You may want to consider taking one step back a think about the design process. Once you are familiar with that you can often skip it and directly implement it. I would suggest to start out with the tables (in theory called relations). You seem to have found:

* Articles
* Tags

Next step is to think about what uniquely identifies a row (tuple) in Articles and Tags. That may be a single column (attribute) or a combination of columns. What would you use to identify an article if you talked to someone about it? All such attributes or combination of attributes is known as candidate keys. In this process, you may choose to add an artificial (surrogate) key, I assume id is such an attribute. The reasons for adding surrogate keys can, for example, be that the natural keys are too complex, not stable enough, etc. You should still declare them as UNIQUE though.

Speaking of id, I strongly suggest that you name it article_id or similar, and stick with that name through the model. Names such as Id, Name, etc are too vague to be meaningful and usually collides with reserved words in the DBMS that you use. Assume we have concluded this:

CREATE TABLE Articles
( article_id ... not null  
, article_name ... not null
, publishing_date date not null
, ...
,    primary key (article_id) -- surrogate key
,    unique ( article_name, publishing_date ) -- natural key
);

CREATE TABLE Tags 
( tag_id ... not null primary key
, tag_name ... not null unique
);

How are these tables related? You seem to have concluded that there is an N-N relationship between an Article and a Tag and this is usually implemented with a third table (sometimes named junction or link table). Let's call this table:

CREATE TABLE Article_Tags
( article_id ... not null references Articles (article_id)
, tag_id ... not null references Tags (tag_id)

);

Now for your original question, what uniquely identifies this relationship? I would suggest here a composite key like:

CREATE TABLE Article_Tags
( article_id ... not null references Articles (article_id)
, tag_id ... not null references Tags (tag_id)
,     primary key (article_id, tag_id)
);

and not add an additional surrogate key (I'm not even sure what the purpose of such key would be, but far too often I see it added in a design by blindfold fashion). With the suggested composite key, you are guaranteed that the same tag won't be used more than once for each article.

If you sometimes want to retrieve all articles for a given tag you can add an index like:

CREATE UNIQUE INDEX ... ON Article_Tags (tag_id, article_id);

EDIT: some info on indexes versus keys

An index is a physical concept and is in fact not even mentioned in the SQL standard, meaning each vendor can implement them as they please, whereas a key of some kind is a logical concept. Note that in some DBMS there's a fuzzy line between these two concepts, for example, it is possible to have a foreign key reference a unique index in SQL-server, and MySQL call indexes keys.

Most DBMS:s (and everyone I know of) implements logical keys (UNIQUE, PRIMARY KEY) with physical indexes, so if you add a key, an index is created under the hood. The following keys are redundant:

PRIMARY KEY (a,b) <=> UNIQUE (b,a)

since if (a,b) is unique, so is (b,a). It is a fat chance that your DBMS won't let you create one if the other is present. On the other hand:

INDEX (a,b) is not redundant with INDEX(b,a)

since a query like:

SELECT ... FROM T WHERE a = ?

can use the first index to efficiently find rows that match, but the second index can't. You can think of it as if you have an index organized as first-name, last-name and you try to find someone with a last-name Gates or Thorvalds. You basically have to scan through a lot of first-names before you find all persons with those last-names.

So why not create one single column index for every column used in searches? Many DBMS can do INDEX ANDING (but some can't) and it is anyhow less efficient than using one multi-column index. Once again you can compare it with a process of first looking at an index for first-name and find out that there's a Bill on page 14, 50, and 126. Then look at another index for last-name and find out that there's a Gates at page 21, 50, 123 and 212, and finally intersect those two sets to conclude that Bill Gates can be found at page 50. Compare with an index for first-name, last-name where you find Bill and then can find Gates very easily.

You can use a multicolumn index a,b,c,d,e,f for finding a, ab, abc, etc, but it won't be very efficient if you look for WHERE d=? and f=?.

The above is much simplified but may give you an idea or two on how to start thinking on the design process.