Well-known name for this “poor man’s ref. integrity” schema design pattern

database-designdesign-patternpolymorphic-associations

Is there a name for the following database schema design/pattern? My eventual goal is to find more literature about the subject. Today's cursory net search was too full of generic words to be able pin down the term (if any exists) for this kind of thing:

Fruit (id, farm)

Apple (fruit_id, color)
    [fruit_id => Fruit.id]

Banana (fruit_id, length)
    [fruit_id => Fruit.id]

Orange (fruit_id, is_seedless)
    [fruit_id => Fruit.id]

FruitPack (id, destination)

FruitPackFruits (fruitpack_id, fruit_id, fruit_type)
    [fruit_id => Fruit.id, fruit_type => VARCHAR]

Where fruit_type would be a varchar column filled with values like "Apple, Banana, Orange, Cherry". It's some kind of "poor man's referential integrity". Obviously, one the failures of this kind of design is being able to insert values that don't resolve out to a useful join (ie: there are no cherries to speak of here).

Here's another example of such a pattern: A single "log (id, table_name, record_id, timestamp)" table that acts as a sort of tracker for modification-times in various other tables. Strictly speaking, it's got no ref integrity, but, the (table_name, record_id) part is supposed to refer to some record in another table, requiring a join to actually get the full data.

I'm going to take for granted that the schema is a sufficient caricature of some sort of collection of groups of items for the people here.

The question is: What's this kind of "poor man's referential integrity" called?

I'm not trying to learn about referential integrity. I want to identify this poor design's name and look further into the "let's design a database schema" aspects (ex: pros, cons, opinions, teachings, etc) that have to do with this commonly seen disaster of a schema.

Best Answer

Your design looks a bit like the "supertype/subtype" pattern. Search for that and for "table inheritance". It needs quite a lot of work to be able to enforce integrity constraints though.

You are missing a generic Fruit table (that's the "supertype") and a FruitType table to store the alllowed fruit types:

FruitType 
    fruit_type PK

Fruit
    fruit_type PK, FK -> FruitType (fruit_type)
    fruit_id   PK

Then the 3 (or 4 or more) tables would be (the "subtype" tables):

Apple
    fruit_type 
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Apple')

Banana 
    fruit_type PK
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Banana')

Orange
    fruit_type PK
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Orange')

And any other table can reference the Fruit table:

FruitPack 
    fruitpack_id PK 
    destination

FruitPackFruits 
    fruitpack_id FK -> FruitPack (fruitpack_id)
    fruit_id     
    fruit_type
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)

It doesn't look very nice and one column in every "fruit" table seems redundant as it has one and only one allowed value. And every time you need to add a new fruit (say Cherry), you have to add a row in the table FruitType and a new table (Cherry), similar to the other ones. So, it works better if your design is more or less stable. If you find that you may need to add a new "fruit" every few days or if you have a thousand (or more!) different fruits, it's not the best way.

On the other hand, it enforces integrity and you can't insert cherries into the Bananas or oranges into the Apples.

Related Solutions

The name of this schema pattern

What you are describing is a data warehouse. The live, normalized, read-write system is OLTP (online transaction processing) and the denormalized read-only snapshot is a data warehouse. The structure of the data warehouse could be a Star Schema, especially if it's highly denormalized. Data warehouses often have summarization in addition to denormalization. There can be many copies of the same data summarized over various dimensions and/or timeframes.

The disadvantages of this technique are that the snapshot is generally not 100% up to date and you have to be very careful about how your snapshot is taken or you could actually introduce discrepancies other than timeliness into your data warehouse. Another possible issue is that you may have difficulty doing some kinds of reporting out of a data warehouse because of the choices you made when rolling up details into you summary tables. Also, if your data warehouse has multiple summaries over different timeframes, for example, you have to be careful to keep these consistent with one-another.

The timeliness issue in particular is one you have to be careful about. I've seen users make a change to their online system and then get angry that it didn't show up right away in a report that is run against the data warehouse. Users tend not to know or care about the vagaries of reporting systems.

Name for this database schema of key values

It's called Entity-Attribute-Value (also sometimes 'name-value pairs') and it's a classic case of "a round peg in a square hole" when people use the EAV pattern in a relational database.

Here's a list of why you shouldn't use EAV:

You can't use data types. It doesn't matter if the value is a date, a number or money (decimal). It's always going to be coerced to varchar. This can be anything from a minor performance problem to a massive gut-ache (ever had to chase down a one-cent variation in a monthly roll-up report?).
You can't (easily) enforce constraints. It requires a ridiculous amount of code to enforce "Everyone needs to have a height between 0 and 3 metres" or "Age must be not null and >= 0", as opposed to the 1-2 lines that each of those constraints would be in a properly-modelled system.
Related to above, you can't easily guarantee that you get the information you need for each client (age might be missing from one, then the next might be missing their height etc.). You can do it, but it's a hell of a lot more difficult than SELECT height, weight, age FROM Client where height is null or weight is null.
Related again, duplicate data is a lot harder to detect (what happens if they give you two ages for one client? De-EAVing the data, as below, will give you two rows of results if you have one attribute doubled. If one client has two separate entries for two attributes, you'll get four rows from the query below).
You can't even guarantee that the attribute names are consistent. "Age_yr" might become "AGE_IN_YEARS" or "age". (Admittedly this is less of a problem when you're receiving an extract versus when people are inserting data, but still.)
Any sort of nontrivial query is a complete disaster. To relationalise a three-attribute EAV system so you can query it in a rational fashion requires three joins of the EAV table.

Compare:

SELECT cID.ID AS [ID], cH.Value AS [Height], cW.Value AS [Weight], cA.Value AS [Age]
FROM (SELECT DISTINCT ID FROM Client) cID 
      LEFT OUTER JOIN 
    Client cW ON cID.ID = cW.ID AND cW.Metric = "Wt_kg" 
      LEFT OUTER JOIN 
    Client cH ON cID.ID = cH.ID AND cW.Metric = "Ht_cm" 
      LEFT OUTER JOIN 
    Client cA ON cID.ID = cA.ID AND cW.Metric = "Age_yr"

To:

SELECT c.ID, c.Ht_cm, c.Wt_kg, c.Age_yr
FROM Client c

Here's a (very short) list of when you should use EAV:

When there's absolutely no way around it and you have to support schema-less data in your database.
When you just need to store "stuff" and don't expect to have to need it in a more structured form. Beware, though, the monster called "changing requirements".

I know I just spent this entire post detailing why EAV is a terrible idea in most cases - but there are a few cases where it's needed/unavoidable. however, most of the time (including the example above), it's going to be far more hassle than it's worth. If you have a requirement for wide support of EAV-type data input, you should look at storing them in a key-value system, e.g. Hadoop/HBase, CouchDB, MongoDB, Cassandra, BerkeleyDB.