Normalizing nearly identical tables

database-designnormalization

Background

I'm managing a relatively small database project in which we are adding support for reporting involving status updates for items in out product table. I was bit thrown into this and I've only got about a month of ever writing SQL.

Problem Description

At it's core we have a central table [product] with a bigint unique key. We now want to record various messages that come in from a satellite application. The messages come in 2 major types (MessageA and MessageB) that are almost identical, MessageB contains an extra column that MessageA doesn't posses. Also of note is that there is no overlap between the 2 message_type columns and no columns are NULL. That is to say both messages have their own set of message_types.

MessageA:

id
timestamp
message_type
product_id
floor_id

MessageB:

id
timestamp
message_type
product_id
floor_id
section_number

What I tried

My initial design was to add 2 tables, one for each new data type exactly mirroring the datatypes. This "seemed" more "normalized" based off my month of so of SQL experience. But after I started writing a query that tried to combine the data into a report, I couldn't come up with a non-redundant query to build the dataset. My primitive query looked like:

Pseudocode

(
   SELECT MessageA.* FROM product
   WHERE <filtering crieteria on product>
   JOIN MessageA ON MessageA.product_id = product.id
)
UNION ALL
(
   SELECT MessageB.* FROM product
   WHERE <identical to first filter>
   JOIN MessageB ON MessageB.product_id = product.id
)

I'm a little paranoid about the long-term performance implications of querying [product] twice since it's our biggest table (adds up to 1M rows a year, maybe more) and the DB runs largely unmaintained off-site on consumer level hardware for an average life-cycle of 3-5 years between upgrades and we have had some reports trickle in of issue at out largest sites. These new 2 tables would potentially grow at 3-7 times the rate of [product] (possibly 5 million rows per year or more).

I started to think it might be simpler to just have 1 table and make section_number NULL. If section_number = NULL then it is or type A otherwise it is B.

The actual question

Is this a good idea?

Should I be worrying about this optimization?

Is this even an optimization or just a more accommodating design?

I'm looking for some guidance whether I should shape the data based on "input" format or "output". Normalization is elegant but at what point should I bend the schema to look like the desired output structure?

Best Answer

There are two ways of approaching the answer to your question:

First: Is pre-optimization a good idea?

As a general rule, don't pre-optimize on the assumption that you will have a problem. Use volume testing to determine if you have a problem and denormalize for optimization purposes if that is the best of your available solutions/compromises.

Second: Is this a good case for denormalization?

Having said that, there is a practical limit to how much you want to be fussy about functional dependencies. Are your type A and type B messages really that different? They both seem to quack like a duck, as it were. Having only a single attribute different, and that difference being whether it is null for one set of records and not null for another set of records isn't necessarily a good reason to implement two distinct message tables.

You might want to have a logical model that makes the distinction between type A and type B messages, but it doesn't necessarily follow that your physical model has to implement these two entity-types as separate tables.

You have the option of using a constraint to enforce the relationship between message type and section number. You don't have to implement your constraint through normalization.

Related Solutions

Confusion normalizing data to 3NF and transitive dependencies

You have:

R (ABCDEFGH)

AB-->D
B-->C
B-->E
A-->H
H-->G

I'm not sure how you resulted in the decompositions but the first thing to do usually is to find the candidate keys if the relations. Here there is only one, the ABF.

Then you identify that the relation is not in 2NF, as there are dependencies on parts of candidate keys and not the whole of them (this part seems to have been done it quite well.) The partial dependencies (on parts of the candidate keys) are these and you only missed the last one, that can be derived from the others:

AB-->D
B-->C
B-->E
A-->H
A-->G             -- missed this one

So (leaving the first question for the end), the decompositions to:

R (ABCDEFGH) 
                          -- splitting because of the AB -> D, fine
R*(ABCEFGH)  R1 (ABD)
                          -- splitting because of the B -> C, fine
R**(ABEFGH)  R2 (BC)
                          -- splitting because of the B -> E, fine
R***(ABFGH)  R3 (BE)

are rather fine, except some minor details like the D reappearing in the R relation.

The next part which is your second question regarding the A->H and H-G was problematic due to your missing of the implied A->G dependency. It should be:

                          -- splitting because of the A -> HG
R****(ABF)   R4 (AHG)

We have now resulted in 5 relations, all in 2NF:

R**** (ABF)
R1 (ABD)      -- AB-> D
R2 (BC)       -- B -> C
R3 (BE)       -- B -> E
R4 (AHG)      -- A -> HG ,  H -> G

In fact all are in 3NF except for R4. So, we have to split that into two relations:

R4* (AH)      -- A -> H
R5 (HG)       -- H -> G

We now have 6 relations (R****, R1, R2, R3, R4*, R5) which are all in 3NF and BCNF as well. In fact they are all in 4, 5 and 6NF.

Regarding your first question, you could have used one split into (BCE) instead of the two (BC) and (BE) and it would have been perfectly fine. Your final result would be five relations (instead of six), all in 3NF - and all the way up to BCNF, 4NF, 5NF, 6NF (except for that (BCE) which is in 5NF.)

Database Design – How to Know When to Stop Normalizing

When designing a database, normalization is a process undertaken when producing the logical model i.e. the relational data model. Performance is an attribute of the physical model; the implementation of the logical model on a particular DataBase Management System.

You cannot "over-normalize" a relational data model. Either the relational data model is in fifth normal form or it is in a lower normal form and update anomalies exist.

The normalization process is a well-defined procedure for analyzing the functional, multivalued and join dependencies of the relations within the model and taking projections of these relations to eliminate any dependencies not implied by the candidate key.

Assuming you have a relational data model in fifth normal form which has been implemented on a particular DBMS. At some point the performance of this implementation has become unacceptable. How you decide to address this issue will depend on the DBMS that is being used and the features that DBMS makes available.

Having looked at the features available to you for the DBMS being used you may decide to take some of tables in the DBMS and re-factor them so that they are in a lower normal form. (Note: the logical model has not changed, you have changed the physical model to address a performance issue with the implementation.) By re-factoring these tables you have introduced the possibility of update anomalies and, because your relational data model is in fifth normal form, you can determine precisely where these may occur. Having determined these you may choose to ignore them, or add some additional processing to either identify them and/or address them. (Note: by adding additional processing you may impact performance such that the perceived benefits of "denormalizing" these tables is actually negated.) In any case you are able to make an informed analysis of any action you may take.