PostgreSQL values to unique dictionary

compressiondatabase-designdatabase-sizepostgresql

I have table with one billion rows and more than 50 columns. I need to reduce size and speed up queries, backup, exports, etc. Some columns contain f.e. only hundreds of distinct values which are long URLs (text data type), used application names and similar duplicate information.

Is there some PG tool, script for PostgreSQL 9.3+ which can easily for selected columns create dictionaries of distinct values to other tables and after that update original values with SmallInt identificator from that dictionary? Do I have to write SQL for that manually?

TableOriginal
1;VeryLongURLText
2;VeryLongURLText
3;LoooongURLText
4;LoooongURLText
5;LoooongURLText

TableDictionary
1;VeryLongURLText
2;LoooongURLText

TableUpdated
1;1
2;1
3;2
3;2
3;2

Thank you.

Best Answer

Do I have to write SQL for that manually?

Yes, but it's not that hard:

create table original (id integer, url text);
insert into original 
values
(1,'VeryLongURLText'),
(2,'VeryLongURLText'),
(3,'LoooongURLText'),
(4,'LoooongURLText'),
(5,'LoooongURLText');

create the dictionary

create table dictionary (id serial, url text);
insert into dictionary (url)
select distinct url
from original;

This creates the table with the following content:

id | data           
---+----------------
 1 | LoooongURLText 
 2 | VeryLongURLText

Now create a new table based on the dictionary:

create table compressed 
as
select o.id, o.some_column, o.other_column, d.id as dictionary_id
from original o
  join dictionary d on o.url = d.url;

As your goal is to reduce the space overhead it's better to create new table with the dictionary id rather then altering the existing one. This will also be a lot faster then updating all rows from the existing table (with a billion rows this will however still take some time)