I'm trying to design a relational database based on a bunch of CSV data that I already have. It's mostly for the purposes of organization and querying subsets, and I won't be adding any new records via transactions. My data describes around 60 attributes of businesses, and it is longitudinal as they are surveyed each year (25 years).
My problem is that some attributes will see many duplicates over the life of a business, and others will see very few. Here's a dumbed down example:
id, year, company_name, type, num_employees, total_sales,
056, 2000, papa johns, fast food, 11, 45000
056, 2001, papa johns, fast food, 11, 45557
056, 2002, papa johns, fast food, 14, 50000
056, 2003, papa johns, pizza, 17, 55000
056, 2004, papa johns, pizza, 17, 55456
063, 1998, pops barbershop, barber shops, 3, 15000
063, 1999, fresh cutz, barber shops, 3, 15023
063, 2000, fresh cutz, barber shops, 3, 15000
I'm looking at around 60 million unique ids, most with a good number of years as well. If the database was in a form similar to this it would be massive and full of duplicates.
Any suggestions on a manner of organizing it?
Best Answer
According to Dr. E. F. Codd, the originator of the Relational Model of Data:
Therefore, at the logical level —and taking into account the stated intention to create a relational database—, the column values duplicates presented in your sample table are not harmful.
If the relevant database has to contain large volumes of information, then it consequently must be managed as such; however, there are some aspects that can be optimized conceptually, logically, physically and pragmatically speaking.
Reorganization suggestions
Expository IDEF1X diagram
Having the referred sample table and data as a reference, it appears that the combination of the columns
(year, company_name)
is sufficient in ensuring row uniqueness (so it should be declared as a KEY, be it PRIMARY or ALTERNATE); therefore, the column labeled asid
would be superfluous (assuming that it is a special addendum meant to contain surrogate key2 values, which lack business meaning, and, quite probably, would require handling an extra INDEX at the physical level).Nevertheless,
company_name
and (company_
?)type
seem to be columns that hold data that is physically “heavy” (in terms of bytes, perhaps of typeCHAR(n)
orVARCHAR(n)
), so I consider convenient to make some structural reorganizations, as depicted in the IDEF1X3 diagram contained in Figure 1:Sample SQL-DDL logical-level design
Then, in accordance with said IDEF1X diagram, we can derive the following logical-level DDL layout:
Company
The
company_id
column, of typeINT
(or something similar, depending on the platform of choice), would decidedly be much “lighter” thancompany.name
(helping to, e.g., speed up data retrieval) that is why I added and fixed it as the PRIMARY KEY (PK) of this table.The name of each company would only be retained once in the
company.name
column, optimizing disk space usage if that is one of the (physical) concerns. Said column should be established as an ALTERNATE KEY (AK) by means of UNIQUE and NOT NULL constraints, which prevents the duplication of the same name value (thus, protecting data integrity and consistency).Company Type
The
company_type
table would carry out a “look-up” role.Its PK, called
company_type_code
, being of typeCHAR(2)
can retain values with business meaning and is far smaller than the column that encloses the fullcompany_type.name
, points that enhance data retrieval quickness and maintain readability (quite useful when, e.g., the end-users are interpreting result sets, the data is analyzed with a reporting tool, at the application program code debugging phase, etc.).This table may hold the data shown below:
Since the
company_type.name
values would be kept exclusively in this table, this rearrangement would as well permit exploiting disk space in a more advantageous manner. This column, being an AK, also demands UNIQUE and NOT NULL constraints.There might well be some other columns that serve “look-up” usage in your database, so a comparable approach can be followed.
Year
I integrated a
year
table because I esteem it can be helpful to enforce the exclusive retention of data that is associated with years that have already been INSERTed in this table.Organizational Register
All the the modifications discussed above will have the more substantial impacts on the processes that pertain to the
organizational_register
table.Since, comparatively, it is very likely that this table is the one that will keep the larger quantity of rows, having two FK constraints of reduced size, i.e.,
company_id
andcompany_type_code
, the disk space consumption would be smaller.Moreover, as the “reduced” FK values would very likely be involved in SELECT operations as search conditions or filters, they would accelerate data retrieval speed at the physical tier.
Optional columns
As you mentioned that the real table has more than 60 columns —which, in my experience, is considerably wide— and also that you will usually be querying subsets of them, it would be as well favourable to identify which columns are optional, and then move them to separate tables, grouping the ones that bear a conceptual-level association. This procedure would naturally yield several tables that are narrower (in the order of columns) and, being of course “lighter” compared with a wide one, behave generally faster with respect to retrieval and/or computing processes (e.g., the calculation of an average) at the physical level.
In this way, let us suppose that
company_type_code
andnumber_of_employees
are optional columns (generally handled via NULL marks). Hence, the expository IDEF1X Diagram may evolve as shown in Figure 2:And a the tentative logical arrangement would be modified as follows:
So,
organizational_register
would be split into two tables:sales
, that would hold the mandatory columns, andorganizational_register
, that will incorporate the optional ones.Using ACID Transactions
The discussed reorganizations would require multiple INSERTs in order to manipulate and retain all the relevant data; in this respect, all the pertinent operations might require to be executed within the same ACID TRANSACTION, so that they either succeed or fail as a single Unit of Work. Perhaps, some kind of batch process would be worth investigating and (or) testing.
JOINs and views
Then in order to derive all the columns of the table as presented in your question, you would have to create a SELECT statement that JOINs the necessary tables.
Certainly, defining one or more views based on the queries that include the JOINs discussed above would be very practical in specific cases, as you can SELECT information directly from the pertinent view with ease.
Summary
In general, my recommendations are:
Endnotes
1 Codd, E. F. (Jan. 1990). Preface. In The Relational Model for Database Management: Version 2 (p. viii). Boston, MA, USA: Addison-Wesley.
2 E.g., a column with the IDENTITY property in Microsoft SQL Server, or with the AUTO_INCREMENT attribute in MySQL, etc.
3 Integration Definition for Information Modeling (IDEF1X) is a highly recommendable data modeling technique that was established as a standard in December 1993 by the United States National Institute of Standards and Technology (NIST). It is solidly based on (a) the early theoretical work authored by Dr. E. F. Codd; on (b) the Entity-Relationship view, developed by Dr. P. P. Chen; and also on (c) the Logical Database Design Technique, created by Robert G. Brown.