Postgresql – Simplified table using more disk space than its original (bigger) version

arraydatabase-designdisk-spacepostgresqlstorage

Using PostgreSQL 9.3 on Debian 7.

I have a system that searches for text tokens inside PDF files and store where it was found for indexation.

Basically the table structure is:

                                               Table "capture.process_publication"
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
|           Column           |            Type             |                                      Modifiers                                       |
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
| id_process_publication     | bigint                      | not null default nextval('process_publication_id_process_publication_seq'::regclass) |
| search_token               | character varying(25)       | not null                                                                             |
| publication_date           | date                        | not null                                                                             |
| id_publication_site        | smallint                    | not null                                                                             |
| id_publication_book        | smallint                    | not null                                                                             |
| search_token_page          | integer                     | not null                                                                             |
| search_token_page_position | integer                     | not null                                                                             |
| publication_snippet        | text                        | not null                                                                             |
| insert_time                | timestamp without time zone | not null default now()                                                               |
| integration_consume_time   | timestamp without time zone |                                                                                      |
| elastic_consume_time       | timestamp without time zone |                                                                                      |
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
Indexes:
    "process_publication_pkey" PRIMARY KEY, btree (id_process_publication)
    "idx_pp_publication_date_desc_for_elastic_consume" btree (publication_date DESC) WHERE integration_consume_time IS NOT NULL AND elastic_consume_time IS NULL
    "idx_pp_publication_date_desc_integration_consume_time_null" btree (publication_date DESC) WHERE integration_consume_time IS NULL
    "idx_uniq_pp" btree (publication_date, id_publication_book, search_token_page, search_token_page_position, search_token)
Foreign-key constraints:
    "fk_pp_pubb_id_publication_book" FOREIGN KEY (id_publication_book) REFERENCES publication_book(id_publication_book)
    "fk_pp_pubs_id_publication_site" FOREIGN KEY (id_publication_site) REFERENCES publication_site(id_publication_site)
Has OIDs: no

The problem was that certain tokens used to appear multiple times on the same page and it was generating a lot of redundant information, so I've changed the table structure to store only one record per token per page storing the positions and snippets as arrays as follows:

                                                Table "capture.process_publication_cp"
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
|           Column            |            Type             |                                         Modifiers                                          |
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
| id_process_publication_cp   | bigint                      | not null default nextval('process_publication_cp_id_process_publication_cp_seq'::regclass) |
| search_token                | character varying(25)       |                                                                                            |
| publication_date            | date                        |                                                                                            |
| id_publication_site         | smallint                    |                                                                                            |
| id_publication_book         | smallint                    |                                                                                            |
| search_token_page           | integer                     |                                                                                            |
| search_token_page_positions | integer[]                   |                                                                                            |
| publication_snippets        | text[]                      |                                                                                            |
| insert_time                 | timestamp without time zone |                                                                                            |
| integration_consume_time    | timestamp without time zone |                                                                                            |
| elastic_consume_time        | timestamp without time zone |                                                                                            |
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
Indexes:
    "process_publication_cp_pkey" PRIMARY KEY, btree (id_process_publication_cp)
    "idx_uniq_pp_cp" UNIQUE, btree (publication_date, id_publication_book, search_token_page, search_token_page_positions, search_token)
Has OIDs: no

The old table has 440 million records while its new version has 432 million, 8 million less. The weird thing is that the new version is using 1GB more disk space than the old one as follows:

capture.process_publication     188 GB    
capture.process_publication_cp  189 GB

My question is very simple: Why? How does PostgreSQL store and manipulates array columns? It doesn't make sense to me at all.

Best Answer

You may be overlooking the 24 bytes of overhead for array types. Details:

This overhead is added twice to every row, while you only save 8 of 440 million rows. Doesn't seem to pay.

I suggest a normalized 1:n table design.

Also make sure you get valid measurements. Compare apples to apples, query the size after VACUUM FULL or immediately after creation with FILLFACTOR 100 and no dead rows. And be sure to get the right measurement:


After your question update, I suggest something like this:

CREATE TABLE capture.token_publication_v2 (
  id_process_publication   bigserial PRIMARY KEY
, publication_date         date NOT NULL
, id_publication_site      smallint NOT NULL
, id_publication_book      smallint NOT NULL
, search_token_page        integer NOT NULL
, search_token             varchar(25) NOT NULL  -- move down to optimize alignment padding
, insert_time              timestamptz NOT NULL DEFAULT now()
, integration_consume_time timestamptz
, elastic_consume_time     timestamptz
);


CREATE TABLE publication_snippet (
   id_process_publication bigint NOT NULL
      REFERENCES capture.token_publication_v2 ON UPDATE CASCADE ON DELETE CASCADE
   publication_snippet_id bigserial PRIMARY KEY
 , publication_snippet    text NOT NULL
);

CREATE TABLE search_token_page_position (
   id_process_publication     bigint NOT NULL
      REFERENCES capture.token_publication_v2 ON UPDATE CASCADE ON DELETE CASCADE
   publication_snippet_id     bigserial PRIMARY KEY
 , search_token_page_position integer NOT NULL
);

Related: