Mysql – Normalizing/combing mutliple tables with similar data

database-designMySQLmysql-workbenchnormalizationschema

I'm working on setting up a database of vehicle records based on data from a third party vendor. There are 3 models: Vehicles, Transmissions, and Adaptors.

Both Vehicles and Transmissions are 1:n Adaptors. Technically Transmissions are just Vehicles with additional transmission columns, while Vehicles contain fuel system columns in addition to the vehicle columns. I am having trouble determining the best way of combining my data into one normalized set.

Here are some samples of my data:

Transmissions table

**Transmissions table**

Vehicles Table
enter image description here

Pivot Tables

enter image description here

As you can see, Transmissions and Vehicles are basically the same up to the trans_type/fuel_system columns (respectively) and beyond. For example, the Vehicle I have highlighted is the same as the 2nd Transmission shown with engine_code LML.

Ideally, I am hoping to end up with just one table of vehicle data. For instance, if I were to merge the highlighted records, I should end up with only 2 records for a 2013 GMC Sierra 3500 HD:
One with a 6.0L V8 engine and the other with a 6.6L V8 engine, along with the respective fuel system and transmission columns for each.

The pivot tables would also be combined as well which means I need to replace the old IDs with new IDs from the combined data, somehow.

Here are my thoughts on how to approach this:

  1. Normalize the similar columns (makes, models, engine, transmission)
  2. Simply split off the fuel_system and trans_type columns into their own tables (but this does nothing to help me combine, it just makes things a bit more manageable until I can figure out how to combine)
  3. Make a new table containing all the columns from both tables, and insert the data from Vehicles and Transmissions, fill in the blanks (i.e. update fuel_system where records match and vice versa for transmission columns), then begin to clean up duplicates.

I apologize for the lengthy post, but I have yet to find anything in my searches that really outlines what this process may look like. Any advice or suggestions are welcome, and thank you in advance.

Best Answer

Functional dependencies and normalization

In order to carry out a normalization exercise involving second and further normal forms —as per the relational model of data by Dr. E. F Codd—, one first has to know what are the relevant functional dependencies (FDs for brevity) between the attributes (usually depicted as columns) of an adapted mathematical relation (usually portrayed as a table). This kind of exercise belongs to the logical-level of abstraction of a database. That is why, in order to address your stated intention of obtaining a normalized set, I requested information about the applicable FDs via comments.

For instance, a FD involving the hypothetical attributes Foo and Bar can be depicted as Foo → Bar, which may in turn be read as “attribute Foo determines attribute Bar” or “attribute Bar is determined by attribute Foo”. In this way one can distinguish (a) one or more attributes, or one or more combinations of attributes, that are the key, or the keys, of a relation and (b) distinguish the attributes that are not, or are not part of, the key or keys.

Regarding the scenario you describe, let us suppose that the grid entitled Transmissions is a concrete representation of a mathematical relation. With respect to the values of the columns labeled as submodel, body_code, engine_code, engine, desplacement, cylinders, trans_type, trans_code and qualifier, one can say that:

  • Some (engine_code and trans_code) appear to be determined by the values of what would be the key of a relation called, let us say, Vehicle — i.e., the combination of (model, make, year).

  • Some (transmission_type and qualifier) appear to be determined by the values of what would be the key of a relation called, let us say, TransmissionAdaptor — i.e., trans_code.

  • Some (engine_type, displacement and cylinders) appear to be determined by the values of what would be the key of a relation called, let us say, Engine — i.e., engine_code.

But those are simple assumptions, based on my personal interpretation of the information contained in the aforementioned grid, and evidently I am not familiar at all with the business context in question. Therefore, in order to get rid of those kinds of needless and problematic assumptions you have to interview the business experts, and they will help you to identify the FDs, which would in turn guide you in performing proper normalization and laying out the database structure with the precision that the database administration profession requires. If there are no business experts who you can resort to, then you will have to dive into the data sets, observe data usage and meaning, and analyze carefully the interconnections between the pieces of information of interest to determine the significant FDs on your own.

A normalized set

Having a normalized database helps to avoid update/modification anomalies (affecting INSERT, UPDATE and DELETE operations) that arise eventually when there exist undesirable dependencies among the attributes (columns) of the relations (tables) under consideration. In cases in which, e.g., there are attributes of a relation that (i) depend on non-key attributes, or that (ii) depend on parts of a composite —i.e., multi-attribute— key, the designer has to decompose the concerning relation into two or more.

Thus, most of the time, a normalized set is made up of various relations, each of which is meant to contain exactly one particular type of fact in its tuples (rows). Contrarily, unnormalized and non-fully-normalized sets consist of one or more relations that contain more than one type of fact in their tuples.

The design of a database starting from the conceptual level of abstraction

On the other hand, you can design the relevant database starting from a different perspective, analyzing first the structure and associations —or relationshipsa or connections— of the types of things of interest from a purely conceptual point of view, without thinking about relations (tables), attributes (columns), constraints and normalization yet. Of course, this approach also entails close communication between the database designer(s) and the business experts or, in absence of business experts, database designer(s) with in-depth knowledge of the informational needs and characteristics of the business environment.

Sample business rules

So I will put together some hypothetical business rules that will help in the creation of an expository conceptual schema based on mere assumptions about the information contained in your question.

A Vehicle is:

  • primarily identified by exactly one combination of Model, Make and Year
  • moved by exactly one Engine
  • equipped with exactly one TransmissionAdaptor

An Engine:

  • is primarily identified by exactly one EngineCode
  • is catalogued by exactly one EngineType
  • has exactly one Displacement
  • holds exactly one type of Cylinders
  • is provided with exactly one FuelSystem
  • is installed on zero, one or more Vehicles

A TransmissionAdaptor is:

  • primarily identified by exactly one TransmissionCode
  • categorized by exactly one TransmissionType
  • installed on zero, one or more Vehicles

A FuelSystem:

  • is primarily identified by exactly one Name
  • is classified by exactly one FuelSystemType
  • has exactly one Pressure
  • is installed on zero, one or more Vehicles

As demonstrated, (a) the hypothetical association types between the properties of the possible entity types of relevance, and (b) the hypothetical association types between the possible entity types themselves, have been stated in a relatively clear manner.

Clearly, these rules are exclusively a medium to expound a method you could follow to design your database. Being devised as personal interpretations, they of course ought to be confirmed, refuted or adapted in accordance with real business environment features.

Sometimes, a database designer draws a diagram portraying the conceptual schema definitions so as to supply a graphical tool that assists in the communication among all the interested —technical and non-technical— parties.

When the results of this stage are stable to a certain degree, one can start thinking about representing these conceptual aspects by means of logical level constructs more confidently.

a By the way, it is opportune to point out that a conceptual relationship is very different from a logical relation.

Illustrative SQL-DDL logical-level design

Successively, I have created the following four tables that represent the conceptual-level formulations elaborated above:

CREATE TABLE TransmissionAdaptor ( 
    TransmissionAdaptorCode CHAR(10) NOT NULL,
    TransmissionType        CHAR(30) NOT NULL,
    --
    CONSTRAINT TransmissionAdaptor_PK PRIMARY KEY (TransmissionAdaptorCode)
);

CREATE TABLE FuelSystem ( 
    Name           CHAR(6)  NOT NULL,
    FuelSystemType CHAR(10) NOT NULL,
    Pressure       CHAR(10) NOT NULL,
    FuelSystemName CHAR(30) NOT NULL,
    --
    CONSTRAINT FuelSystem_PK PRIMARY KEY (Name)
);

CREATE TABLE MyEngine ( 
    EngineCode     CHAR(3) NOT NULL,
    EngineType     CHAR(8) NOT NULL,
    Displacement   CHAR(3) NOT NULL,
    Cylinders      CHAR(3) NOT NULL,
    FuelSystemName CHAR(6) NOT NULL,
    --
    CONSTRAINT Engine_PK               PRIMARY KEY (EngineCode),
    CONSTRAINT Engine_to_FuelSystem_FK FOREIGN KEY (FuelSystemName)
        REFERENCES FuelSystem (Name)
);

CREATE TABLE Vehicle ( 
    Model                   CHAR(20) NOT NULL,
    Make                    CHAR(10) NOT NULL,
    MyYear                  SMALLINT NOT NULL,
    EngineCode              CHAR(3)  NOT NULL,
    TransmissionAdaptorCode CHAR(10) NOT NULL,
    --
    CONSTRAINT Vehicle_PK                 PRIMARY KEY (Model, Make, MyYear),
    CONSTRAINT Vehicle_to_Engine_FK       FOREIGN KEY (EngineCode)
        REFERENCES MyEngine (EngineCode),
    CONSTRAINT Vehicle_to_Transmission_FK FOREIGN KEY (TransmissionAdaptorCode)
        REFERENCES TransmissionAdaptor (TransmissionAdaptorCode)
);

Since now one would be dealing with logical-level elements, with tables that have clearly declared keys that are based on a clear conceptual schema, a proper normal forms assessment exercise would be called for in order to test the soundness of the design.

Personally, I find that the design sequence described above, i.e.,

  • (1) conceptual formulations ⟷ (2) logical representation ⟷ (3) testing via normal forms

is more natural than directly working with “isolated” functional dependencies, since it aids the designer(s) in understanding the scenario from different levels of abstraction in a thorough way.

Further considerations

System-assigned surrogates

As you can notice, I have not attached an extra Id column (commonly added to contain system-generated surrogate key values) to any of these base tables, since doing so only hinders the (conceptual) modelling and (logical) normalization tasks. Once you have a stable structure with the corresponding constraints, it would be suitable to evaluate the specific cases where the addition of that non-data artifact is beneficial.

Column datatypes

An important factor that was not discussed in detail is the particular domain associated with each column. So, when you continue working on the design of the database, you have to determine what are the most fitting datatypes and sizes of each column.

Data derivation

Having a logical layout like the one above, you have to make use of derived tables (e.g., tables declared by means of SELECT operations that gather columns FROM one or more base or —other— derived tables) to obtain the information as shown, e.g., in your Transmission grid.

Of course, a derived table can be defined as a view that can be queried on further, so as to, e.g., ease writing the code of future data manipulation operations.

It might be useful to study this point to see if you can accomplish your stated objective of, ideally, ending up with just one table of Vehicle data.

Physical-level aspects

The physical level of abstraction is another point that decidedly requires attention, since the configuration of the most convenient indexes is directly associated with the optimal functioning of the database (e.g., reading and writing speed, scalability). Needless to say, you have to take into account the tendencies of the data manipulation operations (or queries) in this respect. Usually, you fix indexes that support columns that are involved, e.g., in WHERE and JOIN clauses.

One of the numerous shortcomings and disadvantages of MySQL as a database management system is that it does not offer built-in support for “indexed” or “materialized” views, so this factor is decidedly worth taking into account if you want to construct a database that works efficiently.

Other physical-level aspects of significance are, e.g., setting up the involved hardware (hard drive, memory, processor, etc.), the database management system, the operating system, the network bandwidth, etc., in an optimal way.


Addendum

Evidence of the importance of possessing full knowledge of the business domain of concern when designing a database arose in a recent comment interaction, in which @Rick James brought up a pertinent consideration about some entity types, their corresponding association types and cardinality ratios in your scenario:

It looks like transmissions:adaptors is many:many, not 1:many. Please clarify.

…to which you responded as follows:

Yes, this is correct. Transmissions have their own adaptors and FuelSystems have their own adaptors. A transmission belongs to zero, one or more transmission adaptor(s) and a transmission adaptor can belong to zero, one or more transmissions.

Consequently, as you know, it is necessary to model those conceptual entity types, i.e.:

  • Transmission, TransmissionAdaptor, FuelSystem and FuelSystemAdaptor,

and the connecting association types, i.e.:

  • Transmission-TransmissionAdaptor and FuelSystem-FuelSystemAdaptor,

so that they reflect your business domain characteristics with the demanded precision.

Once all the related aspects are represented in a logical-level DDL design with the respective tables (puttin up one for each entity/association type), columns, datatypes and constraints, you may like to assess the involved functional dependencies to deliver a strong, normalized, system.