If there is column with values from some finite set S
, for example S = { alive, dead, unknown }
for a life_state
column in a persons
table, when is it beneficial to add a reference table?
Options:
-
Storing duplicated static data in a
life_state
column-
simple strings: code on the application layer level would need to exist to get possible data values since
SELECT DISTINCT
could be missing some values. -
ENUM
: if you want to get all possible data values forlife_state
then you would need to runSHOW COLUMN..
fromINFORMATION_SCHEMA
to get all possible data values and parse that data on the application layer level or have duplicate code on the application layer level. If a new life_state value needed to be added we would need toALTER
the table and lock the whole table up in the process while it is being altered. No additional data could be stored for eachlife_state
value should it be necessary.
-
-
Normalized Create a reference table
life_states
and a linking table back topersons
table. The possible values could be derived fromSELECT name from life_states;
which would return 3 rows. Additional joins are now required to get thelife_state
for anypersons
instance. This is not only slower but requires longer queries/more code on the application layer level. Any additional values could easily be added if necessary, though it isn't anticipated. Furthermore, if some other data needed to be added to the life_state type it could be easily added.
Other considerations:
- INT/ENUM indexing might be quicker/take up less space strings.
Related discussions:
Best Answer
When is it beneficial to move duplicate data to a reference table?
When the referenced data exists independently from the referencing data. In this example, is it useful in the problem domain to know what the possible life states are even when there are no rows in
persons
.When the values can only be drawn from a known, finite set. This is often referred to as "reference data". A foreign key constraint can enforce this.
When referential integrity must be maintained between two (or more) normalized entities. This will be for transactional data where the values will not be known in advance. For example where an application allows the addition of new users at run-time, and has
last_updated_userid
on each table. Again a foreign key will enforce this.When additional domain values are discovered that depend sole on the life_state. This is the usual key dependency normalization.
When meta data is required e.g.
is_active
for "soft deletes" instead of removing the value from the system entirely.Although it is customary to create a surrogate key for the reference table it is not required. For the example given, the reference table could have a single seven-character column with three rows. The human-readable values would then appear in the referencing tables. This would obviate the need to join to the referenced table.