Background
I'm managing a relatively small database project in which we are adding support for reporting involving status updates for items in out product table. I was bit thrown into this and I've only got about a month of ever writing SQL.
Problem Description
At it's core we have a central table [product] with a bigint unique key. We now want to record various messages that come in from a satellite application. The messages come in 2 major types (MessageA and MessageB) that are almost identical, MessageB contains an extra column that MessageA doesn't posses. Also of note is that there is no overlap between the 2 message_type columns and no columns are NULL. That is to say both messages have their own set of message_types.
MessageA:
- id
- timestamp
- message_type
- product_id
- floor_id
MessageB:
- id
- timestamp
- message_type
- product_id
- floor_id
- section_number
What I tried
My initial design was to add 2 tables, one for each new data type exactly mirroring the datatypes. This "seemed" more "normalized" based off my month of so of SQL experience. But after I started writing a query that tried to combine the data into a report, I couldn't come up with a non-redundant query to build the dataset. My primitive query looked like:
Pseudocode
(
SELECT MessageA.* FROM product
WHERE <filtering crieteria on product>
JOIN MessageA ON MessageA.product_id = product.id
)
UNION ALL
(
SELECT MessageB.* FROM product
WHERE <identical to first filter>
JOIN MessageB ON MessageB.product_id = product.id
)
I'm a little paranoid about the long-term performance implications of querying [product] twice since it's our biggest table (adds up to 1M rows a year, maybe more) and the DB runs largely unmaintained off-site on consumer level hardware for an average life-cycle of 3-5 years between upgrades and we have had some reports trickle in of issue at out largest sites. These new 2 tables would potentially grow at 3-7 times the rate of [product] (possibly 5 million rows per year or more).
I started to think it might be simpler to just have 1 table and make section_number NULL. If section_number = NULL then it is or type A otherwise it is B.
The actual question
Is this a good idea?
Should I be worrying about this optimization?
Is this even an optimization or just a more accommodating design?
I'm looking for some guidance whether I should shape the data based on "input" format or "output". Normalization is elegant but at what point should I bend the schema to look like the desired output structure?
Best Answer
There are two ways of approaching the answer to your question:
First: Is pre-optimization a good idea?
As a general rule, don't pre-optimize on the assumption that you will have a problem. Use volume testing to determine if you have a problem and denormalize for optimization purposes if that is the best of your available solutions/compromises.
Second: Is this a good case for denormalization?
Having said that, there is a practical limit to how much you want to be fussy about functional dependencies. Are your type A and type B messages really that different? They both seem to quack like a duck, as it were. Having only a single attribute different, and that difference being whether it is null for one set of records and not null for another set of records isn't necessarily a good reason to implement two distinct message tables.
You might want to have a logical model that makes the distinction between type A and type B messages, but it doesn't necessarily follow that your physical model has to implement these two entity-types as separate tables.
You have the option of using a constraint to enforce the relationship between message type and section number. You don't have to implement your constraint through normalization.