Survey Database – Respondent Table (1NF or 3NF – Pros and Cons)

database-designerdnormalization

Building out a Survey ERD ODS layer. Now i'm debating on the design for my Respondent Table. Not sure if its best to have it as 1NF or 3NF. FOr a quick back story I get a survey extract with respondent, response, and questions data. This data is loaded as his in a staging area. From there I am moving the data from staging to ODS than to DW. But wanted to get some feedback and pros and cons on my respondent table.

Now here is what i'm thinking for my design for the 1NF respondent table and 3NF respondent table.

- **1NF Respondent Table**
 - RESPONDENT_ID PK
 - RESPONDENT_SERIAL_ID
 - RESPONDENT_AGE_CODE
 - RESPONDENT_AGE_DESC
 - RESPONDENT_GENDER_CODE
 - RESPONDENT_GENDER_DESC
 - RESPONDENT_STATELIVEIN_CODE
 - RESPONDENT_STATELIVEIN_DESC
 - RESPONDENT_REGION_CODE
 - RESPONDENT_REGION_DESC
 - CREATE_DATE
 - UPDATE_DATE
 - CHECK_SUM - MD5
 - CURRENT_IND

And 58 other attributes that define a respondent such as zipcode, political party, employment and etc.

 - **3NF Respondent Table**
 - RESPONDENT_ID PK
 - RESPONDENT_SERIAL_ID
 - RESPONDENT_TYPE -- Example (AGE, STATELIVEIN, ZIPCODE, POTPARTY and
   etc)
 - RESPONDENT_CODE
 - RESPONDENT_DESC
 - CREATE_DATE
 - UPDATE_DATE
 - CHECK_SUM - MD5
 - CURRENT_IND

Now my question what are some of the pros and cons on both design of each table and lastly which would you go with and if there are any improvements you see please let me know.

Best Answer

Your question really is not about normalization. Instead it is about specialization vs. generalization with respect to design. Let me give some background to show why this is the case.

Background

A table is a relational (R-Table) table, and thus normalized (meaning in 1NF by definition) if in its design a discipline is followed that ensures:

  • Distinct, un-ordered rows
  • Uniquely named, un-ordered columns
  • Each column contains a single value from the corresponding domain

Normalization, more formally called Projection-Join Normalization, is a scientific process in which one can remove redundancies in R-tables due specifically to join dependencies which are not implied by the candidate keys. The join dependencies are exploited by taking projections based on them to create two or more tables from the original table which removes the redundancy. It is important to note that normalization cannot remove all redundancy. Instead, it can remove only redundancies caused by join dependencies not being implied by the primary key.

The first 3 normal forms and Boyce-Codd Normal Form (BCNF) deal with redundancies due to functional dependencies not implied by the candidate keys. While most designers talk about normalization to the third normal form, what they really mean is normalization to BCNF as that is the normal form with respect to the functional dependencies.

Functional Dependencies

In order to determine if either table is in any normal form, 1NF or 3NF, (and thus an R-Table), we must clearly state the functional dependencies. A mistake most designers make is to assume the functional dependencies based upon what makes sense to them. In simple examples this isn't usually a problem, but in complex real world scenarios it is deadly. For your example let us assume the following functional dependencies for the first table:

RESPONDENT_ID -> RESPONDENT_SERIAL_ID, RESPONDENT_AGE_CODE, RESPONDENT_GENDER_CODE, RESPONDENT_STATELIVEIN_CODE,
RESPONDENT_REGION_CODE, CREATE_DATE, UPDATE_DATE, CHECK_SUM - MD5,
CURRENT_IND, 58 other attributes that define a respondent such as zip code,
political party, employment and etc.
RESPONDENT_AGE_CODE -> RESPONDENT_AGE_DESC
RESPONDENT_GENDER_CODE -> RESPONDENT_GENDER_DESC
RESPONDENT_STATELIVEIN_CODE -> RESPONDENT_STATELIVEIN_DESC
RESPONDENT_REGION_CODE -> RESPONDENT_REGION_DESC

Aside: One thing I will gloss over in the interest of space but that is extremely important is truly understanding all of the candidate keys. In this example we have RESPONDENT_ID as the PK. Are there other data elements that uniquely identify a respondent? Just designating a surrogate key upon which we declare all the other now "non-key" columns functionally dependent does not change the functional dependencies of those non-key columns on some other column or set of columns which also form a candidate key.

Another Aside: There is a CURRENT_IND column. That makes the PK of RESPONDENT_ID very suspicious. If this means what it usually means, then the table could potentially have multiple rows per RESPONDENT_ID, many of which are not current, and one of which is. Then the real candidate key is something like RESPONDENT_ID plus CREATE_DATE. This makes the example too complex to analyze though so I will ignore it. But in reality this is a real problem that implies you really have two types of things here - current respondents and former respondents. This gets into temporal concerns which are complicated to address.

Let us assume the following functional dependencies for the second table:

RESPONDENT_ID -> RESPONDENT_SERIAL_ID, 58 other attributes that
define a respondent such as zip code, political party, employment and etc.
RESPONDENT_ID, RESPONDENT_TYPE -> RESPONDENT_CODE, CREATE_DATE,
UPDATE_DATE, CHECK_SUM - MD5, CURRENT_IND
RESPONDENT_TYPE, RESPONDENT_CODE -> RESPONDENT_DESC

Evaluating The First Respondent Table

Given the assumed functional dependencies, the first table is indeed in 1NF, and thus an R-Table, as it meets the rules laid out in the background. However, realize that you really must look at the data to really determine if those rules are being met. For example, we may assume that each column contains a single value, but upon inspection of the data find that some column has been populated with multiple values separated by commas or pipes. An even more insidious violation of normalization may be using the order of rows or columns to encode additional meaning, for example the arrival order of survey responses. For this example let us assume none of this to be the case and the table is normalized and thus an R-Table.

The R-Table is also in 2NF as there are no partial key dependencies. The T-Table is clearly not in 3NF, as there are transitive dependencies. Each of the 4 DESC columns functionally dependent on the CODE columns, which in turn functionally depend on the RESPONDENT_ID column.

Evaluating The Second Respondent Table

Given the assume functional dependencies, the second table is not in 3NF. It is not even normalized (in 1NF) and thus not an R-Table. The RESPONDENT_ID is claimed to be the PK but there is also a composite candidate key on the RESPONDENT_ID plus the RESPONDENT_TYPE. This is because there must now be multiple rows per respondent, each with a different type and that type's corresponding code. The 3NF violation with a transitive dependency between the RESPONDENT_ID and RESPONDENT_TYPE functionally determining RESPONDENT_CODE and then RESPONDENT_TYPE and RESPONDENT_CODE functionally determining RESPONDENT_DESC is still present.

Conclusion

Many designers often assume that taking a specific data element, such as a categorical domain for age, and breaking it into two data elements which are a name value pair is performing "normalization." It is not. It is instead choosing a more generalized vs. more specific design. Normalization is a scientific process with clear rules. An R-Table in BCNF clearly has less redundancy that one on 2NF. The BCNF R-Table is clearly better with respect to avoiding redundancy. Deciding between specialization vs. generalization however is not a scientific process. So how to know which is better? Well it depends on the situation. The advantage to generalization is that it is more flexible for future considerations only in that the data structure doesn't have to change when more data points are added, but comes at a cost of more complexity. The advantage to specialization is it is simpler and enables the DBMS to constraint the domain of each specific column, but it does require additional columns to be added when new data points are added. Personally I prefer to do things like Einstein said "as simple as possible but no simpler."

With respect to the two tables you've presented, I prefer the first design (over a corrected second design) but again not because it is more or less normalized. It is simply more clear what is being represented. The only change from a normalization perspective I would make would be to place each CODE --> DESC combination into its on R-Table so as to achieve full normalization. If you want to be sure to avoid the anomalies associated with R-Tables that are less than fully normalized you must ensure full normalization.

References

A good primer on normalization (and on data management fundamentals in general) is Fabian Pascal's Practical Database Foundation Series. Paper number one describes business modeling (where dependencies are determined) and paper number two which describes normalization. Chris Date's Database Design and Relational Theory provides an in-depth look at normalization along with orthogonality with respect to reducing redundancy in logical database designs.