Mariadb – Reasons against having “concrete” values in own table with only an ID column and the value itself, and composing bigger objects of IDs

database-designmariadb

This is a follow up question to How to insert values into a highly normalized database without excessive duplicate checking?

The current situation:
We have an "massive" object that has around 30 columns worth of details (while it could be slightly improved, a lot of it sadly does really belong together, and can't easily be semantically separated). Every column simply contains an Integer, an ID key.
Now we have 30 more tables, each only have 2 columns: "id" and "value", where ID is an auto generated key, and value is the actual concrete value.

Whenever we want to get back data from our massive objects we need to do 30 joins from these "value" tables onto the main table.

Reasons given were "that a lot of data is going to be duplicated over the next 2-3 years" (it'd be a lucky stroke if one value gets used more than thrice, and at most 20 times imho, but that's besides the point), and any future changes to those values would be much easier then. Also saving storage space (which might be true, but for perhaps at most 200k records over 2 years nothing that should be noteworthy).

Now I know intuitively a few reasons why this design is bad. But I'm still a novice myself, and I would like proper detailed reasons that I could actually present to change a mind, and perhaps learn something new myself.

Friendly reminder that the database commands we're talking about here are mostly very basic Insert, Select and Update (and very rarely joins) methods. No one in our department is experienced enough in SQL (or willing enough to learn except me) to go above those commands, and almost all logic is done server-side after getting data from the database.

Best Answer

OK, I won't rant about what is 'wrong' in your schema. I will provide an efficient way to batch the normalization for a batch insert.

I will point out two common flaws when "over-normalizing":

  • Don't do it if the id is bigger, on average, than the data. Note: INT is 4 bytes; BIGINT is 8.
  • Don't normalize "continuous" values -- anything numeric or date-related. It makes it very costly to do a range test.

OTOH, 100K rows/year is rather small. If they come in evenly it is a rate of about 5 inserts (plus 30 normalizations) per hour.

If they come in batches, see http://mysql.rjweb.org/doc.php/staging_table#normalization

One thing of note: Things like INSERT IGNORE will "burn" ids. That is, an AUTO_INCREMENT id will be bumped even if it is not needed. The code in that link avoids such.