PostgreSQL – Decrease Database Size with Expression Index

database-designdisk-spaceindexperformancepostgresqlpostgresql-performance

This is my current table definition in a Postgres 10.1-1 database:

CREATE TYPE CUSTOMER_TYPE AS ENUM
('enum1', 'enum2', 'enum3', '...', 'enum15');     -- max length of enum names ~15

CREATE TABLE CUSTOMER(
   CUSTOMER_ONE    TEXT PRIMARY KEY NOT NULL,     -- max 35 char String
   ATTRIBUTE_ONE   TEXT UNIQUE,                   -- max 35 char String
   ATTRIBUTE_TWO   TEXT,                          -- 1-80 char String
   PRIVATEKEYTYPE  CUSTOMER_TYPE                  -- see enum
);

It results in about 4.3x more database size compared to the size of the inserted data. (50 MB, 700.000 lines –> database size is 210 MB)

Attribute_One is computed as hash(Customer_One).

Requirements: fast searches (using algorithms) for columns CUSTOMER_ONE and ATTRIBUTE_ONE. (That's why I think I need an index.)

Typical search query:

select * from customer
where Customer_One='XXX' OR Attribute_One='XXX';

Each SELECT can find a maximum of 1 or 0 matching rows in millions of rows.

Is it possible to further decrease the DB size? I have been told to use an expression index but don't fully understand how this works. A short explanation with an example index or other solution would be great

Is the insert speed effected by those indexes? The faster the better. (To be clear: search speed is more important than insert speed.)

Best Answer

If hash() is an IMMUTABLE function (which should be the case for a function called "hash"!) you can omit storing the functionally dependent attribute_one in the table altogether and add an expression index to support queries on the expression hash(customer_one):

CREATE TABLE customer (
   privatekeytype customer_type     -- move the enum to 1st pos to save some more 
 , customer_one   text PRIMARY KEY
 , attribute_two  text
);

Expression index:

CREATE INDEX customer_attribute_one_idx ON customer (hash(customer_one));

This is exactly as big (identical) as the index supporting your original UNIQUE constraint on the redundant column attribute_one.

Query:

SELECT *
FROM   customer 
WHERE  'XXX' IN (customer_one, hash(customer_one));

Testing with EXPLAIN you'll see index or bitmap index scans like:

->  BitmapOr  (cost=5.34..5.34 rows=5 width=0)
     ->  Bitmap Index Scan on customer_pkey  (cost=0.00..2.66 rows=1 width=0)
           Index Cond: ('XXX'::text = customer.customer_one)
     ->  Bitmap Index Scan on customer_attribute_one_idx  (cost=0.00..2.68 rows=4 width=0)
           Index Cond: ('XXX'::text = hash(customer.customer_one))

About the same performance as with the redundant table column or faster since the table is smaller, yet - which helps overall performance in various ways.

Moving the enum to first position saves a few bytes of alignment padding per row as explained in my previous answer:

Why does the function have to be IMMUTABLE? See: