PostgreSQL – Decrease Database Size with Expression Index

database-designdisk-spaceindexperformancepostgresqlpostgresql-performance

This is my current table definition in a Postgres 10.1-1 database:

CREATE TYPE CUSTOMER_TYPE AS ENUM
('enum1', 'enum2', 'enum3', '...', 'enum15');     -- max length of enum names ~15

CREATE TABLE CUSTOMER(
   CUSTOMER_ONE    TEXT PRIMARY KEY NOT NULL,     -- max 35 char String
   ATTRIBUTE_ONE   TEXT UNIQUE,                   -- max 35 char String
   ATTRIBUTE_TWO   TEXT,                          -- 1-80 char String
   PRIVATEKEYTYPE  CUSTOMER_TYPE                  -- see enum
);

It results in about 4.3x more database size compared to the size of the inserted data. (50 MB, 700.000 lines –> database size is 210 MB)

Attribute_One is computed as hash(Customer_One).

Requirements: fast searches (using algorithms) for columns CUSTOMER_ONE and ATTRIBUTE_ONE. (That's why I think I need an index.)

Typical search query:

select * from customer
where Customer_One='XXX' OR Attribute_One='XXX';

Each SELECT can find a maximum of 1 or 0 matching rows in millions of rows.

Is it possible to further decrease the DB size? I have been told to use an expression index but don't fully understand how this works. A short explanation with an example index or other solution would be great

Is the insert speed effected by those indexes? The faster the better. (To be clear: search speed is more important than insert speed.)

Best Answer

If hash() is an IMMUTABLE function (which should be the case for a function called "hash"!) you can omit storing the functionally dependent attribute_one in the table altogether and add an expression index to support queries on the expression hash(customer_one):

CREATE TABLE customer (
   privatekeytype customer_type     -- move the enum to 1st pos to save some more 
 , customer_one   text PRIMARY KEY
 , attribute_two  text
);

Expression index:

CREATE INDEX customer_attribute_one_idx ON customer (hash(customer_one));

This is exactly as big (identical) as the index supporting your original UNIQUE constraint on the redundant column attribute_one.

Query:

SELECT *
FROM   customer 
WHERE  'XXX' IN (customer_one, hash(customer_one));

Testing with EXPLAIN you'll see index or bitmap index scans like:

->  BitmapOr  (cost=5.34..5.34 rows=5 width=0)
     ->  Bitmap Index Scan on customer_pkey  (cost=0.00..2.66 rows=1 width=0)
           Index Cond: ('XXX'::text = customer.customer_one)
     ->  Bitmap Index Scan on customer_attribute_one_idx  (cost=0.00..2.68 rows=4 width=0)
           Index Cond: ('XXX'::text = hash(customer.customer_one))

About the same performance as with the redundant table column or faster since the table is smaller, yet - which helps overall performance in various ways.

Moving the enum to first position saves a few bytes of alignment padding per row as explained in my previous answer:

Why is my database 12 times bigger than inserted data?

Why does the function have to be IMMUTABLE? See:

Questionable use case

...each CONTENT entry consists of one random word and a text string that is the same for all rows.

A text string that is the same for all rows is just dead freight. Remove it and concatenate it in a view if you need to show it.

Obviously, you are aware of that:

Granted, it is not realistic ... But since I can't control the text ...

Upgrade your Postgres version

Running PostgreSQL 9.3.4

While still on Postgres 9.3, you should at least upgrade to the latest point release (currently 9.3.9). The official recommendation of the project:

We always recommend that all users run the latest available minor release for whatever major version is in use.

Better yet, upgrade to 9.4 which has received major improvements for GIN indexes.

Major problem 1: Cost estimates

The cost of some textsearch functions has been seriously underestimated up to and including version 9.4. That cost is raised by factor 100 in the upcoming version 9.5 like @jjanes describes in his recent answer:

PostgreSQL not using index

Here are the respective thread where this was discussed and the commit message by Tom Lane.

As you can see in the commit message, to_tsvector() is among those functions. You can apply the change immediately (as superuser):

ALTER FUNCTION to_tsvector (regconfig, text) COST 100;

which should make it much more likely that your functional index is used.

Major problem 2: KNN

The core problem is that Postgres has to calculate a rank with ts_rank() for 260k rows (rows=261011) before it can order by and pick the top 5. This is going to be expensive, even after you have fixed other problems as discussed. It's a K-nearest-neighbour (KNN) problem by nature and there are solutions for related cases. But I cannot think of a general solution for your case, since the rank calculation itself depends on user input. I would try to eliminate the bulk of low ranking matches early so that the full calculation only has to be done for few good candidates.

One way I can think of is to combine your fulltext search with trigram similarity search - which offers a working implementation for the KNN problem. This way you can pre-select the "best" matches with LIKE predicate as candidates (in a subquery with LIMIT 50 for example) and then pick the 5 top-ranking rows according to your rank-calculation in the main query.

Or apply both predicates in the same query and pick the closest matches according to trigram similarity (which would produce different results) like in this related answer:

PostgreSQL FTS and Trigram-similarity Query Optimization

I did some more research and you are not the first to run into this problem. Related posts on pgsql-general:

Work is ongoing to eventually implement a tsvector <-> tsquery operator.

Oleg Bartunov and Alexander Korotkov even presented a working prototype (using >< as operator instead of <-> back then) but it's very complex to integrate in Postgres, the whole infrastructure for GIN indexes has to be reworked (most of which is done by now).

Major problem 3: weights and index

And I identified one more factor adding to the slowness of the query. Per documentation:

GIN indexes are not lossy for standard queries, but their performance depends logarithmically on the number of unique words. (However, GIN indexes store only the words (lexemes) of tsvector values, and not their weight labels. Thus a table row recheck is needed when using a query that involves weights.)

Bold emphasis mine. As soon as weight are involved, each row has to be fetched from the heap (not just a cheap visibility check) and long values have to be de-toasted, which adds to the cost. But there seems to be a solution for that:

Index definition

Looking at your index again, it doesn't seem to make sense to begin with. You assign a weight to a single column, which is meaningless as long as you don't concatenate other columns with a different weight.

COALESCE() also makes no sense as long as you don't actually concatenate more columns.

Simplify your index:

CREATE INDEX "File_contentIndex" ON "File" USING gin
(to_tsvector('english', "CONTENT");

And your query:

SELECT "ITEMID", ts_rank(to_tsvector('english', "CONTENT")
                       , plainto_tsquery('english', 'searchTerm')) AS rank
FROM   "File"
WHERE  to_tsvector('english', "CONTENT")
       @@ plainto_tsquery('english', 'searchTerm')
ORDER  BY rank DESC
LIMIT  5;

Still expensive for a search term that matches every row, but probably much less.

Asides

All of these issues combined, the insane cost of 520 seconds for your second query is beginning to make sense. But there still may be more problems. Did you configure your server?
All the usual advice for performance optimization applies.

It makes your life easier if you don't work with double-quotes CaMeL-case identifiers:

Are PostgreSQL column names case-sensitive?

Best Answer

Related Solutions

Postgresql query slowed with table growth

Postgresql – Slow fulltext search for terms with high occurence