Table design: paired values of genetic data

database-design

I'm trying to decide the appropriate way to design my tables to store my genetic data. Basically, I have samples, and each sample has multiple loci (a locus is basically a point on the genome), and each locus has a pair of values representing different alleles, or variants (basically, one allele is obtained from the father, and one from the mother).

There are potentially dozens or even hundreds of loci, depending on the project. The two alleles may or may not have the same value, and there is no meaning in the order they appear.

This is what the raw data might look like, and is also typical of what some programs expect for import:
enter image description here

On the plus side, it's still relatively easy to visually interpret and manually enter or change individual values without some other tool. On the negative side, there is no relationship between allele1 and allele2, they just get entered in a random order. This can make querying the data somewhat more cumbersome in some situations, and ends up involving a union. What really bothered me, though, was that it just doesn't seem to meet normalization rules with having alleles spread between multiple columns. It'd be like having pet1, pet2, pet3… as columns in a table listing owners and their pets.

So, I came up with this, which is what I'm currently using:
enter image description here

I like it more because it seems like it's properly normalized, but the downside is that it is visually harder to work with. I don't mind this, because although I'm a biologist, I have a relatively strong CS background and am used to thinking about data abstractly. But my colleagues do not have that background, making it harder for them than it is for me.

Ultimately, I can write scripts to convert the raw data to the proper format for import, as well as converting data exported from the table for use in other programs, but this all depends on finalizing the table design. So, have I properly designed the table for this data? Or am I mistaken about it being normalized? Is there some other alternative I'm missing?

Thank you.

Best Answer

Your first attempt makes by far the most sense - (the one with 4 fields).

enter image description here

As a rule, tables should be "tall and slim" rather than "short and fat"

It means that you can easily compare different alleles (for the same locus) belonging to the same individual - which is MUCH more difficult with your second schema.

The second schema means that you will have to use windowing/analytic functions far more frequently - and if you're using MySQL (which is very common in biology unfortunately) you won't have that facility.

Speaking as someone with both genetic and computer science degrees, I'd go with the first schema - I've worked with similar data (FragileX) - comparing different alleles - it was important for us to distinguish the father and the mother (we were doing multi-generational pedigrees).

I would recommend you include (arbitrary) - allele1 as the father's and allele2 as the mothers as part of your design - you never know when this knowledge might come in useful. Maybe you could add another field (isParentKnown) so that you can distinguish between those samples for which the parents are known from the ones for which they are not.