File name as primary key

database-designprimary-key

I have a few tables I'm going to create for a website, Literature, video, and image. Is there a good reason why I shouldn't use the file name as a primary key.

For instance…

"someimage.png" as the PK for table image.

The only downside I can think of is if I had named someimage.png in multiple directories and each directory someimage.png was not the same image. This seems like poor design and there isn't any reason why I'd do this anyways.

Thoughts?

Best Answer

The PRIMARY KEY, or any index for that matter, would be accessed much faster if the length of the PRIMARY KEY was smaller. It is easier to put a 4-byte integer as a unique identified for a fullpath image name than the fullpath filename (of various and ridiculous lengths).

Think of the Clustered Index, where the PRIMARY KEY would reside. Row data will occupy a Clustered Index. In MySQL, the Clustered Key would be coupled with other columns in a nonunique index. Wouldn't a smaller datatype (4 bytes) just make more sense? Otherwise, indexes can blow up at a rate of O(n log n).

To create a unique number for each image, you need a table that resembles something like this:

CREATE TABLE images
(
    image_id INT NOT NULL AUTO_INCREMENT,
    image_name VARCHAR(255),
    image_folder VARCHAR(255),
    PRIMARY KEY (image_id),
    UNIQUE KEY name_folder (image_name,image_folder),
    KEY folder (image_folder)
);

This design

allows for multiple files with the same image name, each located in a different folder
denies having two files with the same image name in the same folder

From here, just INSERT and retrieve the created image_id as follows:

INSERT IGNORE INTO images (image_name,image_folder)
VALUES ('someimage.jpg','/some/linux/folder');
SELECT image_id INTO @imageid FROM images
WHERE image_name='someimage.jpg'
AND image_folder='/some/linux/folder';

Doing it this way may lose some image_id numbers along the way. You may want to do this:

SELECT COUNT(1) INTO @image_is_there FROM images
WHERE image_name='someimage.jpg'
AND image_folder='/some/linux/folder';

IF @image_is_there IS 0, then

INSERT INTO images (image_name,image_folder)
VALUES ('someimage.jpg','/some/linux/folder');

Related Solutions

Database Design – Poor Man’s Referential Integrity Schema Design Pattern

Your design looks a bit like the "supertype/subtype" pattern. Search for that and for "table inheritance". It needs quite a lot of work to be able to enforce integrity constraints though.

You are missing a generic Fruit table (that's the "supertype") and a FruitType table to store the alllowed fruit types:

FruitType 
    fruit_type PK

Fruit
    fruit_type PK, FK -> FruitType (fruit_type)
    fruit_id   PK

Then the 3 (or 4 or more) tables would be (the "subtype" tables):

Apple
    fruit_type 
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Apple')

Banana 
    fruit_type PK
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Banana')

Orange
    fruit_type PK
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Orange')

And any other table can reference the Fruit table:

FruitPack 
    fruitpack_id PK 
    destination

FruitPackFruits 
    fruitpack_id FK -> FruitPack (fruitpack_id)
    fruit_id     
    fruit_type
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)

It doesn't look very nice and one column in every "fruit" table seems redundant as it has one and only one allowed value. And every time you need to add a new fruit (say Cherry), you have to add a row in the table FruitType and a new table (Cherry), similar to the other ones. So, it works better if your design is more or less stable. If you find that you may need to add a new "fruit" every few days or if you have a thousand (or more!) different fruits, it's not the best way.

On the other hand, it enforces integrity and you can't insert cherries into the Bananas or oranges into the Apples.

Database Design – Is Matriculation Number a Good Primary Key?

I currently work in this field. This is a case where a surrogate key should be used.

Our software probably deals with a much wider range of scenarios than yours, but that extra flexibility may mean a lot for you in the future.

Student numbers (overall) may change quite frequently depending on how the numbers are assigned by the distribution authority.

When a new student is enrolled, a number or unique identifier may have to be internally assigned to the student before the authority has actually generated a number for the student (asynchronously). In this case, it may be necessary to use more than one field in the database... which crushes the primary key idea immediately.

Student numbers that change are just bad news for lots of reasons. Remember that the primary key in this table will be propagated to many parts of the system, and with many associated rows.
Performance is rarely a concern for single-student data operations. Depending on how the indexes are set up, using the student number as an alternate key may have a slight performance hit. However, at least in my experience, it's pretty rare that this is a significant user experience issue. You're more likely to run into problems with performance of batch processes and reporting that looks at groups of students (by school, by grade, by class, report cards, etc.).
The length and format of a student number varies by jurisdiction. Some may be numeric, some may be alphanumeric. Do you need to handle more than one format? If a numeric student number is used as the primary key now, you set yourself up for a world of hurt later if you need to change to alphanumeric (and not only from a database perspective!).

Also, if that happens (or if you use alphanumeric to start), performance will be diminished, as string comparison is always going to be less efficient than integer comparison.

That you can guarantee the student number is unique is good. Make sure to enforce it using appropriate uniqueness and validation constructs.

Use the student number (alone, or in combination with other business keys) whenever it's necessary to integrate between your application and another.

Your software may fit a narrow box of a situation where it looks "okay" to use the student number as a primary key. If that's the case, I cannot offer many negatives about that approach. However, not using a surrogate now sets you up for pain down the road if you need to venture even an inch out of that box.

Best Answer

Related Solutions

Database Design – Poor Man’s Referential Integrity Schema Design Pattern

Database Design – Is Matriculation Number a Good Primary Key?

Related Question