Table design: paired values of genetic data

database-design

I'm trying to decide the appropriate way to design my tables to store my genetic data. Basically, I have samples, and each sample has multiple loci (a locus is basically a point on the genome), and each locus has a pair of values representing different alleles, or variants (basically, one allele is obtained from the father, and one from the mother).

There are potentially dozens or even hundreds of loci, depending on the project. The two alleles may or may not have the same value, and there is no meaning in the order they appear.

This is what the raw data might look like, and is also typical of what some programs expect for import:

On the plus side, it's still relatively easy to visually interpret and manually enter or change individual values without some other tool. On the negative side, there is no relationship between allele1 and allele2, they just get entered in a random order. This can make querying the data somewhat more cumbersome in some situations, and ends up involving a union. What really bothered me, though, was that it just doesn't seem to meet normalization rules with having alleles spread between multiple columns. It'd be like having pet1, pet2, pet3… as columns in a table listing owners and their pets.

So, I came up with this, which is what I'm currently using:

I like it more because it seems like it's properly normalized, but the downside is that it is visually harder to work with. I don't mind this, because although I'm a biologist, I have a relatively strong CS background and am used to thinking about data abstractly. But my colleagues do not have that background, making it harder for them than it is for me.

Ultimately, I can write scripts to convert the raw data to the proper format for import, as well as converting data exported from the table for use in other programs, but this all depends on finalizing the table design. So, have I properly designed the table for this data? Or am I mistaken about it being normalized? Is there some other alternative I'm missing?

Thank you.

Best Answer

Your first attempt makes by far the most sense - (the one with 4 fields).

As a rule, tables should be "tall and slim" rather than "short and fat"

It means that you can easily compare different alleles (for the same locus) belonging to the same individual - which is MUCH more difficult with your second schema.

The second schema means that you will have to use windowing/analytic functions far more frequently - and if you're using MySQL (which is very common in biology unfortunately) you won't have that facility.

Speaking as someone with both genetic and computer science degrees, I'd go with the first schema - I've worked with similar data (FragileX) - comparing different alleles - it was important for us to distinguish the father and the mother (we were doing multi-generational pedigrees).

I would recommend you include (arbitrary) - allele1 as the father's and allele2 as the mothers as part of your design - you never know when this knowledge might come in useful. Maybe you could add another field (isParentKnown) so that you can distinguish between those samples for which the parents are known from the ones for which they are not.

Related Solutions

Database design advice

                                                  +------------ --- ---+
                                                  | Assembly options   |
+------------+------------+----------+------------+---+---+---+ --- +--+
| assembly ▼ | unit cost  | quantity | total cost | 1 | 2 | 3 |     |50|
+------------+------------+----------+------------+---+---+---+ --- +--+
| VSD55      | £10'000    | 2        | £20'000    | 1 | 1 |   |     |  |

If somebody handed that quote to me, my first question would be "What's option 1 for the VSD55?" The answer would be "I don't know." That information isn't on the quote. In the unlikely event that person got to field a second question, that question would be "What does it cost?" Again, the answer would be "I don't know." A very disturbing silence would follow immediately, during which the person who handed me the quote would imagine how much better it might feel to be run over by a train.

Options must be line items on the quote, along with their unit price, quantity, and total price. Options must be named, not numbered. They should appear directly under their parent assembly, too, not scattered all over hell and half of Georgia.

If you want a shot at my money, you'd better make it crystal clear what I'm supposed to be getting for my money.

There's nothing (much) wrong with 50 check boxes on a user-interface form. That makes it easy to pick options. But the UI code should read the checkboxes and insert the right information into normalized tables.

Mysql – To normalize or not

Answering strictly in terms of MySQL, you do not want to overnormalize.

Here is why ;

REASON #1 : JOIN BUFFER

There is an in-memory buffer called the join buffer. Its size is regulated in RAM by the join_buffer_size option per DB Connection. The more normalized the data are, the more JOIN clauses. In turn, the more JOIN clauses, the more join buffers have to be allocated. If a join buffer is too small, it gets migrated to disk. This slows down the query's execution by

pausing the query to write the join buffer to disk
generating more disk I/O while executing the join

REASON #2 : QUERY EVALUATION

Because MySQL executes SQL through external storage engines, query evaluation for WHERE clauses and JOIN clauses are virtually identical. I wrote about this back on March 11, 2013 (Is there an execution difference between a JOIN condition and a WHERE condition?). The algorithm for JOINs are also in the MySQL Documentation. The JOIN algorithm can get rather wonky with multiple JOIN clauses.

REASON #3 : STORAGE ENGINE PERFORMANCE

Any storage engine in MySQL is basically comprised of 14 operations. Those operations, including JOINs, are layered in such a way that performance issues can occur in two layers

Query Evaluation (See Reason #2)
Storage Engine
- If you JOIN using large columns
- If you use the wrong Storage Engine
- If you mix Storage Engines in the JOIN

EPILOGUE

It is strictly up to you how deep you want to normalize your data. If you must normalize and do multiple JOINs, please join only integer values only. Joining on character values will have negative effects on the three reasons I just mentioned.