Design for scientific data. Data table with hundreds of columns or data table with a generic value column and hundreds of rows (EAV)

database-designeavschema

I am trying to improve the data storage in biomedical science lab where I recently started working. The existing workflow is atrocious, involving many differently formatted Excel sheets that all get aggregated through a process of copy-paste and buggy macros.

My intent is to create a simple python script which aggregates all of the data for an experiment into an SQLite database and then produces the requisite CSV/XLSX output.

My issue is that for a single trial of our experiment, we end up with about 100 variables recorded at about 10 different time points. My initial impulse was to create a value and variable table:

CREATE TABLE value (val_id INTEGER PRIMARY KEY, 
                    value TEXT, 
                    var_id INTEGER,
                    event_id INTEGER,
                    exp_id INTEGER,
                    FOREIGN KEY (var_id) REFERENCES variable(var_id),
                    FOREIGN KEY (event_id) REFERENCES event(event_id),
                    FOREIGN KEY (exp_id) REFERENCES experiemnt(exp_id)
);

CREATE TABLE variable (var_id INTEGER PRIMARY KEY,
                      var_name TEXT,
                       var_type TEXT
);

value:
val_id | value | var_id | ...
0      | 10    | 0
1      | "ROSC"| 5

variable:
var_id | var_name | var_type
0      | Pressure | DECIMAL
...
5      | Outcome  | TEXT

But this feels wrong and I have a hunch at the "proper" way to do this is to have a single data table with the hundreds of columns that would otherwise be described in the variable table since this would make it easier to do type checking (yes, I know SQLite doesn't do this, but in principle).

Any insight into how to tackle this would be hugely appreciated.

Best Answer

What you are describing is the EAV (Entity Attribute Value) model which most database professionals would run a mile from. It's also sarcastically called the OTLT (One True Lookup Table) and is a classic novice mistake. Your hunch is correct!

Here (and here) is the opinion of Joe Celko (a veteran SQL programmer who is/was a member of the SQL standards committee. The fact that he titles his article with the phrase "EAV of destruction" should give you a clue :-). Celko also calls this the Massively Unified Code Key.

It's no coincidence that the acronym is MUCK! :-)

Storing data in this way destroys many of the advantages of relational databases such as DRI (Declarative Referential Integrity), CHECK constraints and DEFAULT values.

Please create a table with 100 fields and 10 rows - if that's what your data requires, then do it. Perhaps a couple of other fields with experiment ID, datetime and experimenter_id would also be useful? That way you can perform analyses over given periods, various other aggregates - basically slice and dice all of your data.

< Personal opinion > If you haven't chosen your database yet and are happy to use F/LOSS, then may I recommend PostgreSQL - it's SQL dialect is by far the richest of the Open Source databases (I have no connection to the project). Check here for how to enforce datatypes in SQLite, but PostgreSQL offers more than SQLite - it's multi-user for example and no need to jump through hoops to enforce datatypes.< / Personal opinion >

[EDIT1]

Just one further remark, and for completeness, there is only one significant system out there that makes use of the EAV model - and that's Magento (1, 2). Its principal niche is the fashion industry where the EAV model may be suitable for sparse tables (fashion items tend to be available in a myriad of colours, syles, sizes...). It is popular (1, 2), but then so is MySQL which is inferior in many ways to PostgreSQL, Firebird and (multi-user capability aside) SQLite.