You can do that in pure SQL. Create a partial unique index in addition to the one you have:
CREATE UNIQUE INDEX ab_c_null_idx ON my_table (id_A, id_B) WHERE id_C IS NULL;
This way you can enter for (id_A, id_B, id_C)
in your table:
(1, 2, 1)
(1, 2, 2)
(1, 2, NULL)
But none of these a second time.
Or use two partial UNIQUE
indexes and no complete index (or constraint). The best solution depends on the details of your requirements. Compare:
While this is elegant and efficient for a single nullable column in the UNIQUE
index, it gets out of hand quickly for more than one. Discussing this - and how to use UPSERT with partial indexes:
Asides
No use for mixed case identifiers without double quotes in PostgreSQL.
You might consider a serial
column as primary key or an IDENTITY
column in Postgres 10 or later. Related:
So:
CREATE TABLE my_table (
my_table_id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY -- for pg 10+
-- my_table_id bigserial PRIMARY KEY -- for pg 9.6 or older
, id_a int8 NOT NULL
, id_b int8 NOT NULL
, id_c int8
, CONSTRAINT u_constraint UNIQUE (id_a, id_b, id_c)
);
If you don't expect more than 2 billion rows (> 2147483647) over the lifetime of your table (including waste and deleted rows), consider integer
(4 bytes) instead of bigint
(8 bytes).
Your first attempt makes by far the most sense - (the one with 4 fields).
As a rule, tables should be "tall and slim" rather than "short and fat"
It means that you can easily compare different alleles (for the same locus) belonging to the same individual - which is MUCH more difficult with your second schema.
The second schema means that you will have to use windowing/analytic
functions far more frequently - and if you're using MySQL (which is very
common in biology unfortunately) you won't have that facility.
Speaking as someone with both genetic and computer science degrees, I'd go with the first schema - I've worked with similar data (FragileX) - comparing different alleles - it was important for us to distinguish the father and the mother (we were doing multi-generational pedigrees).
I would recommend you include (arbitrary) - allele1 as the father's and allele2 as the mothers as part of your design - you never know when this knowledge might come in useful. Maybe you could add another field (isParentKnown) so that you can distinguish between those samples for which the parents are known from the ones for which they are not.
Best Answer
Yes, but it's not that hard:
create the dictionary
This creates the table with the following content:
Now create a new table based on the dictionary:
As your goal is to reduce the space overhead it's better to create new table with the dictionary id rather then
alter
ing the existing one. This will also be a lot faster then updating all rows from the existing table (with a billion rows this will however still take some time)