PostgreSQL values to unique dictionary

compressiondatabase-designdatabase-sizepostgresql

I have table with one billion rows and more than 50 columns. I need to reduce size and speed up queries, backup, exports, etc. Some columns contain f.e. only hundreds of distinct values which are long URLs (text data type), used application names and similar duplicate information.

Is there some PG tool, script for PostgreSQL 9.3+ which can easily for selected columns create dictionaries of distinct values to other tables and after that update original values with SmallInt identificator from that dictionary? Do I have to write SQL for that manually?

TableOriginal
1;VeryLongURLText
2;VeryLongURLText
3;LoooongURLText
4;LoooongURLText
5;LoooongURLText

TableDictionary
1;VeryLongURLText
2;LoooongURLText

TableUpdated
1;1
2;1
3;2
3;2
3;2

Thank you.

Best Answer

Do I have to write SQL for that manually?

Yes, but it's not that hard:

create table original (id integer, url text);
insert into original 
values
(1,'VeryLongURLText'),
(2,'VeryLongURLText'),
(3,'LoooongURLText'),
(4,'LoooongURLText'),
(5,'LoooongURLText');

create the dictionary

create table dictionary (id serial, url text);
insert into dictionary (url)
select distinct url
from original;

This creates the table with the following content:

id | data           
---+----------------
 1 | LoooongURLText 
 2 | VeryLongURLText

Now create a new table based on the dictionary:

create table compressed 
as
select o.id, o.some_column, o.other_column, d.id as dictionary_id
from original o
  join dictionary d on o.url = d.url;

As your goal is to reduce the space overhead it's better to create new table with the dictionary id rather then altering the existing one. This will also be a lot faster then updating all rows from the existing table (with a billion rows this will however still take some time)

Asides

No use for mixed case identifiers without double quotes in PostgreSQL.

You might consider a serial column as primary key or an IDENTITY column in Postgres 10 or later. Related:

Auto increment table column

So:

CREATE TABLE my_table (
   my_table_id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY  -- for pg 10+
-- my_table_id bigserial PRIMARY KEY  -- for pg 9.6 or older
 , id_a int8 NOT NULL
 , id_b int8 NOT NULL
 , id_c int8
 , CONSTRAINT u_constraint UNIQUE (id_a, id_b, id_c)
);

If you don't expect more than 2 billion rows (> 2147483647) over the lifetime of your table (including waste and deleted rows), consider integer (4 bytes) instead of bigint (8 bytes).

Table design: paired values of genetic data

Your first attempt makes by far the most sense - (the one with 4 fields).

As a rule, tables should be "tall and slim" rather than "short and fat"

It means that you can easily compare different alleles (for the same locus) belonging to the same individual - which is MUCH more difficult with your second schema.

The second schema means that you will have to use windowing/analytic functions far more frequently - and if you're using MySQL (which is very common in biology unfortunately) you won't have that facility.

Speaking as someone with both genetic and computer science degrees, I'd go with the first schema - I've worked with similar data (FragileX) - comparing different alleles - it was important for us to distinguish the father and the mother (we were doing multi-generational pedigrees).

I would recommend you include (arbitrary) - allele1 as the father's and allele2 as the mothers as part of your design - you never know when this knowledge might come in useful. Maybe you could add another field (isParentKnown) so that you can distinguish between those samples for which the parents are known from the ones for which they are not.

Best Answer

Related Solutions

PostgreSQL – Multi-Column Unique Constraint and NULL Values

Asides

Table design: paired values of genetic data

Related Question