When is it beneficial to move duplicate data to a reference table

denormalizationenumnormalizationoptimization

If there is column with values from some finite set S, for example S = { alive, dead, unknown } for a life_state column in a persons table, when is it beneficial to add a reference table?

Options:

  • Storing duplicated static data in a life_state column

    • simple strings: code on the application layer level would need to exist to get possible data values since SELECT DISTINCT could be missing some values.

    • ENUM: if you want to get all possible data values for life_state then you would need to run SHOW COLUMN.. from INFORMATION_SCHEMA to get all possible data values and parse that data on the application layer level or have duplicate code on the application layer level. If a new life_state value needed to be added we would need to ALTER the table and lock the whole table up in the process while it is being altered. No additional data could be stored for each life_state value should it be necessary.

  • Normalized Create a reference table life_states and a linking table back to persons table. The possible values could be derived from SELECT name from life_states; which would return 3 rows. Additional joins are now required to get the life_state for any persons instance. This is not only slower but requires longer queries/more code on the application layer level. Any additional values could easily be added if necessary, though it isn't anticipated. Furthermore, if some other data needed to be added to the life_state type it could be easily added.

Other considerations:

  • INT/ENUM indexing might be quicker/take up less space strings.

Related discussions:

Best Answer

When is it beneficial to move duplicate data to a reference table?

  1. When the referenced data exists independently from the referencing data. In this example, is it useful in the problem domain to know what the possible life states are even when there are no rows in persons.

  2. When the values can only be drawn from a known, finite set. This is often referred to as "reference data". A foreign key constraint can enforce this.

  3. When referential integrity must be maintained between two (or more) normalized entities. This will be for transactional data where the values will not be known in advance. For example where an application allows the addition of new users at run-time, and has last_updated_userid on each table. Again a foreign key will enforce this.

  4. When additional domain values are discovered that depend sole on the life_state. This is the usual key dependency normalization.

  5. When meta data is required e.g. is_active for "soft deletes" instead of removing the value from the system entirely.

Although it is customary to create a surrogate key for the reference table it is not required. For the example given, the reference table could have a single seven-character column with three rows. The human-readable values would then appear in the referencing tables. This would obviate the need to join to the referenced table.