Postgresql – Simplified table using more disk space than its original (bigger) version

arraydatabase-designdisk-spacepostgresqlstorage

Using PostgreSQL 9.3 on Debian 7.

I have a system that searches for text tokens inside PDF files and store where it was found for indexation.

Basically the table structure is:

                                               Table "capture.process_publication"
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
|           Column           |            Type             |                                      Modifiers                                       |
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
| id_process_publication     | bigint                      | not null default nextval('process_publication_id_process_publication_seq'::regclass) |
| search_token               | character varying(25)       | not null                                                                             |
| publication_date           | date                        | not null                                                                             |
| id_publication_site        | smallint                    | not null                                                                             |
| id_publication_book        | smallint                    | not null                                                                             |
| search_token_page          | integer                     | not null                                                                             |
| search_token_page_position | integer                     | not null                                                                             |
| publication_snippet        | text                        | not null                                                                             |
| insert_time                | timestamp without time zone | not null default now()                                                               |
| integration_consume_time   | timestamp without time zone |                                                                                      |
| elastic_consume_time       | timestamp without time zone |                                                                                      |
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
Indexes:
    "process_publication_pkey" PRIMARY KEY, btree (id_process_publication)
    "idx_pp_publication_date_desc_for_elastic_consume" btree (publication_date DESC) WHERE integration_consume_time IS NOT NULL AND elastic_consume_time IS NULL
    "idx_pp_publication_date_desc_integration_consume_time_null" btree (publication_date DESC) WHERE integration_consume_time IS NULL
    "idx_uniq_pp" btree (publication_date, id_publication_book, search_token_page, search_token_page_position, search_token)
Foreign-key constraints:
    "fk_pp_pubb_id_publication_book" FOREIGN KEY (id_publication_book) REFERENCES publication_book(id_publication_book)
    "fk_pp_pubs_id_publication_site" FOREIGN KEY (id_publication_site) REFERENCES publication_site(id_publication_site)
Has OIDs: no

The problem was that certain tokens used to appear multiple times on the same page and it was generating a lot of redundant information, so I've changed the table structure to store only one record per token per page storing the positions and snippets as arrays as follows:

                                                Table "capture.process_publication_cp"
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
|           Column            |            Type             |                                         Modifiers                                          |
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
| id_process_publication_cp   | bigint                      | not null default nextval('process_publication_cp_id_process_publication_cp_seq'::regclass) |
| search_token                | character varying(25)       |                                                                                            |
| publication_date            | date                        |                                                                                            |
| id_publication_site         | smallint                    |                                                                                            |
| id_publication_book         | smallint                    |                                                                                            |
| search_token_page           | integer                     |                                                                                            |
| search_token_page_positions | integer[]                   |                                                                                            |
| publication_snippets        | text[]                      |                                                                                            |
| insert_time                 | timestamp without time zone |                                                                                            |
| integration_consume_time    | timestamp without time zone |                                                                                            |
| elastic_consume_time        | timestamp without time zone |                                                                                            |
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
Indexes:
    "process_publication_cp_pkey" PRIMARY KEY, btree (id_process_publication_cp)
    "idx_uniq_pp_cp" UNIQUE, btree (publication_date, id_publication_book, search_token_page, search_token_page_positions, search_token)
Has OIDs: no

The old table has 440 million records while its new version has 432 million, 8 million less. The weird thing is that the new version is using 1GB more disk space than the old one as follows:

capture.process_publication     188 GB    
capture.process_publication_cp  189 GB

My question is very simple: Why? How does PostgreSQL store and manipulates array columns? It doesn't make sense to me at all.

Best Answer

You may be overlooking the 24 bytes of overhead for array types. Details:

Calculating and saving space in PostgreSQL

This overhead is added twice to every row, while you only save 8 of 440 million rows. Doesn't seem to pay.

I suggest a normalized 1:n table design.

Also make sure you get valid measurements. Compare apples to apples, query the size after VACUUM FULL or immediately after creation with FILLFACTOR 100 and no dead rows. And be sure to get the right measurement:

Measure the size of a PostgreSQL table row

After your question update, I suggest something like this:

CREATE TABLE capture.token_publication_v2 (
  id_process_publication   bigserial PRIMARY KEY
, publication_date         date NOT NULL
, id_publication_site      smallint NOT NULL
, id_publication_book      smallint NOT NULL
, search_token_page        integer NOT NULL
, search_token             varchar(25) NOT NULL  -- move down to optimize alignment padding
, insert_time              timestamptz NOT NULL DEFAULT now()
, integration_consume_time timestamptz
, elastic_consume_time     timestamptz
);


CREATE TABLE publication_snippet (
   id_process_publication bigint NOT NULL
      REFERENCES capture.token_publication_v2 ON UPDATE CASCADE ON DELETE CASCADE
   publication_snippet_id bigserial PRIMARY KEY
 , publication_snippet    text NOT NULL
);

CREATE TABLE search_token_page_position (
   id_process_publication     bigint NOT NULL
      REFERENCES capture.token_publication_v2 ON UPDATE CASCADE ON DELETE CASCADE
   publication_snippet_id     bigserial PRIMARY KEY
 , search_token_page_position integer NOT NULL
);

Configuring PostgreSQL for read performance

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – Simple PostgreSQL lookup table is inexplicably slow

Rather than "VARCHAR(40000)" why not use "TEXT"?
HASH index use is discouraged (see docs at http://www.postgresql.org/docs/current/static/indexes-types.html).
Have you run "ANALYZE" on your tables before running the query?
Giant IN lists can be performance killers.

How do the following queries perform?

SELECT e.entityId
FROM entities e
INNER JOIN triples t ON (t.object = e.entityId)
LIMIT 10000;

SELECT e.entityId
FROM entities e
WHERE EXISTS (SELECT 1 FROM triples t WHERE t.object = e.entityId LIMIT 10000);

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – Simple PostgreSQL lookup table is inexplicably slow

Related Question