Normalizing nearly identical tables

database-designnormalization

Background

I'm managing a relatively small database project in which we are adding support for reporting involving status updates for items in out product table. I was bit thrown into this and I've only got about a month of ever writing SQL.

Problem Description

At it's core we have a central table [product] with a bigint unique key. We now want to record various messages that come in from a satellite application. The messages come in 2 major types (MessageA and MessageB) that are almost identical, MessageB contains an extra column that MessageA doesn't posses. Also of note is that there is no overlap between the 2 message_type columns and no columns are NULL. That is to say both messages have their own set of message_types.

MessageA:

  • id
  • timestamp
  • message_type
  • product_id
  • floor_id

MessageB:

  • id
  • timestamp
  • message_type
  • product_id
  • floor_id
  • section_number

What I tried

My initial design was to add 2 tables, one for each new data type exactly mirroring the datatypes. This "seemed" more "normalized" based off my month of so of SQL experience. But after I started writing a query that tried to combine the data into a report, I couldn't come up with a non-redundant query to build the dataset. My primitive query looked like:

Pseudocode

(
   SELECT MessageA.* FROM product
   WHERE <filtering crieteria on product>
   JOIN MessageA ON MessageA.product_id = product.id
)
UNION ALL
(
   SELECT MessageB.* FROM product
   WHERE <identical to first filter>
   JOIN MessageB ON MessageB.product_id = product.id
)

I'm a little paranoid about the long-term performance implications of querying [product] twice since it's our biggest table (adds up to 1M rows a year, maybe more) and the DB runs largely unmaintained off-site on consumer level hardware for an average life-cycle of 3-5 years between upgrades and we have had some reports trickle in of issue at out largest sites. These new 2 tables would potentially grow at 3-7 times the rate of [product] (possibly 5 million rows per year or more).

I started to think it might be simpler to just have 1 table and make section_number NULL. If section_number = NULL then it is or type A otherwise it is B.

The actual question

Is this a good idea?

Should I be worrying about this optimization?

Is this even an optimization or just a more accommodating design?

I'm looking for some guidance whether I should shape the data based on "input" format or "output". Normalization is elegant but at what point should I bend the schema to look like the desired output structure?

Best Answer

There are two ways of approaching the answer to your question:

First: Is pre-optimization a good idea?

As a general rule, don't pre-optimize on the assumption that you will have a problem. Use volume testing to determine if you have a problem and denormalize for optimization purposes if that is the best of your available solutions/compromises.

Second: Is this a good case for denormalization?

Having said that, there is a practical limit to how much you want to be fussy about functional dependencies. Are your type A and type B messages really that different? They both seem to quack like a duck, as it were. Having only a single attribute different, and that difference being whether it is null for one set of records and not null for another set of records isn't necessarily a good reason to implement two distinct message tables.

You might want to have a logical model that makes the distinction between type A and type B messages, but it doesn't necessarily follow that your physical model has to implement these two entity-types as separate tables.

You have the option of using a constraint to enforce the relationship between message type and section number. You don't have to implement your constraint through normalization.