Designing Database for Longitudinal Survey Data

database-design

I'm trying to design a relational database based on a bunch of CSV data that I already have. It's mostly for the purposes of organization and querying subsets, and I won't be adding any new records via transactions. My data describes around 60 attributes of businesses, and it is longitudinal as they are surveyed each year (25 years).

My problem is that some attributes will see many duplicates over the life of a business, and others will see very few. Here's a dumbed down example:

id, year, company_name, type, num_employees, total_sales,
056, 2000, papa johns, fast food, 11, 45000
056, 2001, papa johns, fast food, 11, 45557
056, 2002, papa johns, fast food, 14, 50000
056, 2003, papa johns, pizza, 17, 55000
056, 2004, papa johns, pizza, 17, 55456
063, 1998, pops barbershop, barber shops, 3, 15000
063, 1999, fresh cutz, barber shops, 3, 15023
063, 2000, fresh cutz, barber shops, 3, 15000

I'm looking at around 60 million unique ids, most with a good number of years as well. If the database was in a form similar to this it would be massive and full of duplicates.

Any suggestions on a manner of organizing it?

Best Answer

According to Dr. E. F. Codd, the originator of the Relational Model of Data:

Duplicate values are permitted within columns of a relation, but duplicate rows are prohibited.1

Therefore, at the logical level —and taking into account the stated intention to create a relational database—, the column values duplicates presented in your sample table are not harmful.

If the relevant database has to contain large volumes of information, then it consequently must be managed as such; however, there are some aspects that can be optimized conceptually, logically, physically and pragmatically speaking.

Reorganization suggestions

Expository IDEF1X diagram

Having the referred sample table and data as a reference, it appears that the combination of the columns (year, company_name) is sufficient in ensuring row uniqueness (so it should be declared as a KEY, be it PRIMARY or ALTERNATE); therefore, the column labeled as id would be superfluous (assuming that it is a special addendum meant to contain surrogate key2 values, which lack business meaning, and, quite probably, would require handling an extra INDEX at the physical level).

Nevertheless, company_name and (company_?) type seem to be columns that hold data that is physically “heavy” (in terms of bytes, perhaps of type CHAR(n) or VARCHAR(n)), so I consider convenient to make some structural reorganizations, as depicted in the IDEF1X3 diagram contained in Figure 1:

Fig. 1. Company Information IDEF1X Diagram

Sample SQL-DDL logical-level design

Then, in accordance with said IDEF1X diagram, we can derive the following logical-level DDL layout:

-- As these are only illustrative examples, 
-- you should determine which are the most fitting 
-- data types and sizes for all the table columns 
-- depending on your business context characteristics.

-- Also, you should make accurate tests to define the most
-- convenient INDEX strategies.

CREATE TABLE company (
    company_id       INT      NOT NULL,
    name             CHAR(30) NOT NULL,
    created_datetime DATETIME NOT NULL,
    --
    CONSTRAINT company_PK PRIMARY KEY (company_id),
    CONSTRAINT company_AK UNIQUE      (name) -- ALTERNATE KEY.
);

CREATE TABLE year ( -- “Look-up” table.
    year             SMALLINT NOT NULL,
    created_datetime DATETIME NOT NULL,
    --
    CONSTRAINT year_PK PRIMARY KEY (year)
);

CREATE TABLE company_type ( -- “Look-up” table.
    company_type_code CHAR(2)  NOT NULL,
    name              CHAR(30) NOT NULL,
    --
    CONSTRAINT company_type_PK PRIMARY KEY (company_type_code),
    CONSTRAINT company_type_AK UNIQUE      (name) -- ALTERNATE KEY.
);

CREATE TABLE organizational_register (
    company_id          INT           NOT NULL,
    year                SMALLINT      NOT NULL,
    company_type_code   CHAR(2)       NOT NULL,
    number_of_employees INT           NOT NULL,
    total_sales         PERTINENTTYPE NOT NULL
    created_datetime    DATETIME      NOT NULL,
    --
    CONSTRAINT organizational_register_PK                 PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
    CONSTRAINT organizational_register_TO_company_FK      FOREIGN KEY (company_id)
        REFERENCES company (company_id),
    CONSTRAINT organizational_register_TO_year_FK         FOREIGN KEY (year)
        REFERENCES year    (year),
    CONSTRAINT organizational_register_TO_company_type_FK FOREIGN KEY (company_type_code)
        REFERENCES company (company_type_code)
);

Company

The company_id column, of type INT (or something similar, depending on the platform of choice), would decidedly be much “lighter” than company.name (helping to, e.g., speed up data retrieval) that is why I added and fixed it as the PRIMARY KEY (PK) of this table.

The name of each company would only be retained once in the company.name column, optimizing disk space usage if that is one of the (physical) concerns. Said column should be established as an ALTERNATE KEY (AK) by means of UNIQUE and NOT NULL constraints, which prevents the duplication of the same name value (thus, protecting data integrity and consistency).

Company Type

The company_type table would carry out a “look-up” role.

Its PK, called company_type_code, being of type CHAR(2) can retain values with business meaning and is far smaller than the column that encloses the full company_type.name, points that enhance data retrieval quickness and maintain readability (quite useful when, e.g., the end-users are interpreting result sets, the data is analyzed with a reporting tool, at the application program code debugging phase, etc.).

This table may hold the data shown below:

 +-—————————————————-+-————————————-+
 | company_type_code | name         |  
 +-—————————————————-+-————————————-+
 | FF                | Fast food    |
 +-------------------+--------------+
 | PI                | Pizza        |
 +-------------------+--------------+
 | BS                | Barber shops |
 +-------------------+--------------+

Since the company_type.name values would be kept exclusively in this table, this rearrangement would as well permit exploiting disk space in a more advantageous manner. This column, being an AK, also demands UNIQUE and NOT NULL constraints.

There might well be some other columns that serve “look-up” usage in your database, so a comparable approach can be followed.

Year

I integrated a year table because I esteem it can be helpful to enforce the exclusive retention of data that is associated with years that have already been INSERTed in this table.

Organizational Register

All the the modifications discussed above will have the more substantial impacts on the processes that pertain to the organizational_register table.

Since, comparatively, it is very likely that this table is the one that will keep the larger quantity of rows, having two FK constraints of reduced size, i.e., company_id and company_type_code, the disk space consumption would be smaller.

Moreover, as the “reduced” FK values would very likely be involved in SELECT operations as search conditions or filters, they would accelerate data retrieval speed at the physical tier.

Optional columns

As you mentioned that the real table has more than 60 columns —which, in my experience, is considerably wide— and also that you will usually be querying subsets of them, it would be as well favourable to identify which columns are optional, and then move them to separate tables, grouping the ones that bear a conceptual-level association. This procedure would naturally yield several tables that are narrower (in the order of columns) and, being of course “lighter” compared with a wide one, behave generally faster with respect to retrieval and/or computing processes (e.g., the calculation of an average) at the physical level.

In this way, let us suppose that company_type_code and number_of_employees are optional columns (generally handled via NULL marks). Hence, the expository IDEF1X Diagram may evolve as shown in Figure 2:

Fig. 2. Company Information IDEF1X Diagram - Example Modifications

And a the tentative logical arrangement would be modified as follows:

CREATE TABLE company (
    company_id       INT      NOT NULL,
    name             CHAR(30) NOT NULL,
    created_datetime DATETIME NOT NULL,
    --
    CONSTRAINT company_PK PRIMARY KEY (company_id),
    CONSTRAINT company_AK UNIQUE      (name) -- ALTERNATE KEY.
);

CREATE TABLE year (
    year             SMALLINT NOT NULL,
    created_datetime DATETIME NOT NULL,
    --
    CONSTRAINT year_PK PRIMARY KEY (year)
);

CREATE TABLE sales (
    company_id       INT           NOT NULL,
    year             SMALLINT      NOT NULL,
    total_sales      PERTINENTTYPE NOT NULL
    column_x         FOO           NOT NULL,
    column_y         BAR           NOT NULL,
    created_datetime DATETIME      NOT NULL,
    --
    CONSTRAINT yearly_sales_PK     PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
    CONSTRAINT sales_TO_company_FK FOREIGN KEY (company_id)
        REFERENCES company (company_id),
    CONSTRAINT sales_TO_year_FK    FOREIGN KEY (year)
        REFERENCES year    (year),
);

CREATE TABLE company_type (
    company_type_code CHAR(2)  NOT NULL,
    name              CHAR(30) NOT NULL,
    --
    CONSTRAINT company_type_PK PRIMARY KEY (company_type_code),
    CONSTRAINT company_type_AK UNIQUE      (name) -- ALTERNATE KEY.
);

CREATE TABLE organizational_register (
    company_id          INT      NOT NULL,
    year                SMALLINT NOT NULL,
    company_type_code   CHAR(2)  NOT NULL,
    number_of_employees INT      NOT NULL,
    created_datetime    DATETIME NOT NULL,
    --
    CONSTRAINT organizational_register_PK                 PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
    CONSTRAINT organizational_register_TO_company_FK      FOREIGN KEY (company_id)
        REFERENCES company (company_id),
    CONSTRAINT organizational_register_TO_year_FK         FOREIGN KEY (year)
        REFERENCES year    (year),
    CONSTRAINT organizational_register_TO_company_type_FK FOREIGN KEY (company_type_code)
        REFERENCES company (company_type_code)
);

So, organizational_register would be split into two tables:

  • sales, that would hold the mandatory columns, and
  • organizational_register, that will incorporate the optional ones.

Using ACID Transactions

The discussed reorganizations would require multiple INSERTs in order to manipulate and retain all the relevant data; in this respect, all the pertinent operations might require to be executed within the same ACID TRANSACTION, so that they either succeed or fail as a single Unit of Work. Perhaps, some kind of batch process would be worth investigating and (or) testing.

JOINs and views

Then in order to derive all the columns of the table as presented in your question, you would have to create a SELECT statement that JOINs the necessary tables.

Certainly, defining one or more views based on the queries that include the JOINs discussed above would be very practical in specific cases, as you can SELECT information directly from the pertinent view with ease.

Summary

In general, my recommendations are:

  • setting up PK constraints on columns with data types and sizes that “behave” (physically) faster;
  • retaining the business meaning of (sometimes multicolumn) PKs and FKs when possible;
  • identifying what are the columns that can be moved to new tables, and place them accordingly based on conceptual-level associations; and
  • defining (physical) indexing strategies based on the retrieval tendencies that apply.

Endnotes

1 Codd, E. F. (Jan. 1990). Preface. In The Relational Model for Database Management: Version 2 (p. viii). Boston, MA, USA: Addison-Wesley.

2 E.g., a column with the IDENTITY property in Microsoft SQL Server, or with the AUTO_INCREMENT attribute in MySQL, etc.

3 Integration Definition for Information Modeling (IDEF1X) is a highly recommendable data modeling technique that was established as a standard in December 1993 by the United States National Institute of Standards and Technology (NIST). It is solidly based on (a) the early theoretical work authored by Dr. E. F. Codd; on (b) the Entity-Relationship view, developed by Dr. P. P. Chen; and also on (c) the Logical Database Design Technique, created by Robert G. Brown.

Related Question