Should a composite primary key out of 3-4 attributes or a unique artificial ID should be used for the situation below

database-designerd

For a small beginner project to start with databases and system development, I'm trying to develop a system that represents the classic case of an administration system for students and exams. This is based on the following diagram:

The two simple entities student and exam don't cause me any problems, but I'm struggling to find the best way to solve the many-to-many relationship between the two. I realize that I have to create an additional table for this, but I'm not sure if I should choose a composite key of 3-4 attributes or an artificial key with a unique ID. I have worked out both approaches a little, here are the results:

Composite Key:

With this solution, the idea is to identify a test result via the MatriculationNumber, ExamID and Date. In my opinion, the date must also be added to the key, since it is possible that exams that have not been passed must be taken again and managed in the system. Therefore, MatriculationNumber and ExamID would not be sufficient to uniquely determine the two completed exams. However, I also came up with an extreme case: What happens if an exam and the catch-up date (for whatever reason) take place on the same day and both have the same grade, say 5.0? Then it is not possible to distinguish these two things. The whole thing would become even more difficult if even a third attempt had to be taken into account.

Artificial ID:

In this case, a new, artificial ID is to be created in order to uniquely determine each result. However, there are no implicit advantages to the first solution, which, for example, allows only one entry for each exam and student and tag. However, I would like to be as flexible as possible with my solution, and not let the design of the database dictate my professionalism.

In general I would like to know from you experts which of the two solutions fits better for my application. It should be possible to obsolete several attempts of the same exam of a single student, even if they could fall on the same day. I don't want to let the database dictate or limit the subject. It should also be possible to build a REST-API based on what I find rather difficult in the case of the composite key, at least after what I have read on the net. In your opinion, what is the best practice for this problem?

EDIT
To describe the overall situation even better, a few additional details: For the implementation I would like to use a postgreSQL database. Because it is only a simple introduction example, where no billions of records are stored or time-critical operations are needed, the focus is not on performance at all. However, I find it more important to be able to expand the application and business logic later, for example by storing documents or other information. I don't want a feature to destroy the complete structure of my database or API. My main concern here is that it can happen to me that I have to make such rudimentary changes if I want additional information in the solution with the merged key. On the other hand, I still lack a clear statement or justification why I should introduce an artificial ID and what are the advantages of it

Best Answer

I agree with your analysis. Using (MatriculationNumber, ExamID, Date) as the key of TestResult will, indeed, allow each person to take the test once per day at most.

What to do about it depends on the problem domain - what is and is not allowed in real life. If a person is not permitted to resit an exam immediately your initial model is fine. You've indicated you would like to accommodate this, so further data must be captured to support it.

The easiest change would be to stop using DATE and start using DATETIME¹. This will allow separate results down to the resolution of the clock. Inside the computer that's about microseconds. In real life, say, one minute. Can a student complete an exam twice inside sixty seconds? No - good; dust hands; job done.

Let's say we must retain the DATE column, because < reasons >. Then some other attribute must be added. I'll propose a new column called attempt_number. And because I must know this value to uniquely identify a row it becomes part of the table key. It now has a four-column composite key on (MatriculationNumber, ExamID, Date, attempt_number).

What values should my proposed column allow? One option is it resets each day. On Monday I have attempts 1, 2 and 3. On Tuesday I have attempts 1 through 7 (busy day). On Wednesday I start again at 1 and pass on attempt 4. Another option is that it resets per exam - I start with attempt 1 on Monday and I pass at attempt 14 on Wednesday. Yet another is it resets per student: this exam took attempts 43-45 and 47-57 because I have already had 42 goes at other exams before I started this one and I attempted a different exam on Monday afternoon.

There is no right or wrong option here. Each is valid. Each makes some queries easier and some harder (e.g. "how many times in total did Joe sit exam Alpha" versus "how many times did Joe sit exam Alpha on Tuesday"). The choice would be governed by the real-world practices you are modeling.

There is one further option for values of attempt_number - that they are global across all students, exams and dates. The first time anyone sits any exam it is given attempt number 1, the next is 2 and so on. attempt_number is, in fact, unique through the whole table. This is a good point to introduce some nomenclature.

A key is any column or set of columns whose values uniquely distinguish one row from all other rows in that table. Because multiple sets of columns can be unique each set is called a candidate key. In formal relational theory a relation must have at least one candidate key and can have more. Relational DBMSs (the physical implementation of relational theory) however typically allow tables (the implementation of relations) to exist without any key defined. Keys that consist of columns that occur naturally in day-to-day user conversation are termed natural keys. Every student will know her matriculation number. Every examiner will know the exam's ID. If you say "this is my 4th attempt today," that will be undestood. These could be used in natural keys. Sometimes we introduce new columns for performance or design considerations. If these new columns summarise and stand in for other candidate keys they are called surrogate keys. They are system-generated an internal. They are not well-know to people in the domain. No-one would understand if you exclaimed "I passed on ResultID 8720982!"

Thinking further about an attempt_number that is global across the whole TestResult table i.e. every row has a different value. Since knowing one value uniquely identifies a row it is now a candidate key. Since it stands in for the natural key (MatriculationNumber, ExamID, Date, < something else >) it is also a surrogate key. This is what I think you mean with your ResultID column. If we implement ResultID in this way, my proposed attempt_number is redundant and can be forgotten.

Surrogates have the advantage that they are shorter than the natural keys they represent. This saves space at the "child" end of a foreign key, for example. Sometimes it is just not possible to uniquely identify an entity by natural values. A person is the canonical example. They do have great flexibility as we as programers define them and can define their range and semantics to suit whatever purpose we choose. The drawback of surrogates is that UI input, report output and data transfers invariably need the "natural" values, so the foreign keys have to be traversed in these circumstances.

For the example given, I would implement ResultID as a surrogate ID. It's values will be global across the table. This will give flexibility to extend the schema easily in the future. I would also change DATE to DATETIME so a believable natural key also exists.

¹ Don't ever name a column the same as a datatype. This column will be exam_start_time or somesuch.

Related Solutions

Mysql – Exclude dataset from a query

This is standard SQL

SELECT
   *
FROM
   EXAM E
WHERE
   NOT EXISTS (SELECT *
       FROM
          CANCELED_EXAM CE
       WHERE
          E.StudentID = CE.StudentID AND
          E.SubjectID = CE.SubjectID AND
          E.SchoolYear = CE.SchoolYear AND
          E.Period = CE.Period )

Ms-access – Database Design Regarding a Relationship for Patient Visit

The first thing that catches my eye is what appear to be separate tables for each exam. This could get rather difficult to maintain in the future: if the structure of exams changes you need to update n tables. Also, any aggregate queries that cover all exams for 1 patient will need to join n tables.

I'd suggest a structure like this:

patient_exams
-------------
  patient_exam_id (PK)
  visit_id (FK)
  patient_id (FK)
  study_id (Fk)
  exam_seq_num 
  (other fields)

Use exam_seq_num to track which exam number the record is for a patient. You can use a composite key containing patient_id, visit_id, study_id, exam_seq_num to ensure that you don't get exams for a patient with duplicate sequence numbers. You'll still need a bit of code to create the correct sequence number, maybe an on-insert trigger.

UPDATE:

Ok, so now it's clear that the exame tables are actually for different types of exams. You could have something like

base_exam
---------
  id (PK)
  patient_id (FK)
  exam_date
  (other stuff)

Lumbar_exam_details
-------------------
  lumbar_exam_id (PK)
  base_exam_id (FK to base_exam.id)
  (other specific fields)

blood_exam_details
------------------
  blood_exam_id (pk)
  base_exam_id (FK to base_exam.id)
  (other specific fields)

All of your exam detail tables reference the base_exam table, which stores common fields for all exams (such as the date of the exam, the patient who was examined, etc...).

If you really want to have a "display name" for exam types, I would do that in a view that overlays the specific exam table. For example, the query for lumbar_exam_view might look like:

SELECT *, "Lumbar Exam" AS DISPLAY_NAME
FROM LUMBAR_EXAM_DETAILS

Use this view in any queries/reports on lumbar_exam_details you will have access to display_name anywhere that you want the user-friendly string.

If you need the display name to be stored as actual data, you can add an exam_type_id field to base_exam and then have it point to an exam_type table:

exam_type
---------
  id
  display_name

Data:

exam_types
ID  | display_name
------------------
1   | Lumbar Exam
2   | Blood test

Now your base exam records have an ID that points them to the correct user-friendly string. Note that this does not ensure that the exam detail record is of the correct type (i.e. it is possible, for example, to have a base_exam record that is referenced by lumbar_exam_details, but the base_exame record erroneously references the display name "Blood Test") - it only works on the display name.

Best Answer

Related Solutions

Mysql – Exclude dataset from a query

Ms-access – Database Design Regarding a Relationship for Patient Visit

Related Question