Using PostgreSQL 9.3 on Debian 7.
I have a system that searches for text tokens inside PDF files and store where it was found for indexation.
Basically the table structure is:
Table "capture.process_publication"
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
| Column | Type | Modifiers |
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
| id_process_publication | bigint | not null default nextval('process_publication_id_process_publication_seq'::regclass) |
| search_token | character varying(25) | not null |
| publication_date | date | not null |
| id_publication_site | smallint | not null |
| id_publication_book | smallint | not null |
| search_token_page | integer | not null |
| search_token_page_position | integer | not null |
| publication_snippet | text | not null |
| insert_time | timestamp without time zone | not null default now() |
| integration_consume_time | timestamp without time zone | |
| elastic_consume_time | timestamp without time zone | |
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
Indexes:
"process_publication_pkey" PRIMARY KEY, btree (id_process_publication)
"idx_pp_publication_date_desc_for_elastic_consume" btree (publication_date DESC) WHERE integration_consume_time IS NOT NULL AND elastic_consume_time IS NULL
"idx_pp_publication_date_desc_integration_consume_time_null" btree (publication_date DESC) WHERE integration_consume_time IS NULL
"idx_uniq_pp" btree (publication_date, id_publication_book, search_token_page, search_token_page_position, search_token)
Foreign-key constraints:
"fk_pp_pubb_id_publication_book" FOREIGN KEY (id_publication_book) REFERENCES publication_book(id_publication_book)
"fk_pp_pubs_id_publication_site" FOREIGN KEY (id_publication_site) REFERENCES publication_site(id_publication_site)
Has OIDs: no
The problem was that certain tokens used to appear multiple times on the same page and it was generating a lot of redundant information, so I've changed the table structure to store only one record per token per page storing the positions and snippets as arrays as follows:
Table "capture.process_publication_cp"
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
| Column | Type | Modifiers |
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
| id_process_publication_cp | bigint | not null default nextval('process_publication_cp_id_process_publication_cp_seq'::regclass) |
| search_token | character varying(25) | |
| publication_date | date | |
| id_publication_site | smallint | |
| id_publication_book | smallint | |
| search_token_page | integer | |
| search_token_page_positions | integer[] | |
| publication_snippets | text[] | |
| insert_time | timestamp without time zone | |
| integration_consume_time | timestamp without time zone | |
| elastic_consume_time | timestamp without time zone | |
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
Indexes:
"process_publication_cp_pkey" PRIMARY KEY, btree (id_process_publication_cp)
"idx_uniq_pp_cp" UNIQUE, btree (publication_date, id_publication_book, search_token_page, search_token_page_positions, search_token)
Has OIDs: no
The old table has 440 million records while its new version has 432 million, 8 million less. The weird thing is that the new version is using 1GB more disk space than the old one as follows:
capture.process_publication 188 GB
capture.process_publication_cp 189 GB
My question is very simple: Why? How does PostgreSQL store and manipulates array columns? It doesn't make sense to me at all.
Best Answer
You may be overlooking the 24 bytes of overhead for array types. Details:
This overhead is added twice to every row, while you only save 8 of 440 million rows. Doesn't seem to pay.
I suggest a normalized 1:n table design.
Also make sure you get valid measurements. Compare apples to apples, query the size after
VACUUM FULL
or immediately after creation withFILLFACTOR 100
and no dead rows. And be sure to get the right measurement:After your question update, I suggest something like this:
Related: