Postgresql – Postgres not using expression index

indexperformancepostgresqlpostgresql-9.5query-performanceregular expression

I am using Postgres 9.5. It is possible this is related to the changes in 9.6 pointed out by PostgreSQL not using index during count(*), but I cannot be sure yet.

I have a table that has a character varying column called text. I am attempting to optimize queries which have to perform a regex on this column.

This often looks like

select * from table where text ~ 'some|key|words|we|want';

or the inverse

select * from table where text !~ 'some|key|words|we|want';

sometimes in joins

select category, count(*)
from other_t join table on (other_t.table_id = table.id)
where table.text ~ 'some|key|words|we|want';

select category, count(*)
from other_t join
    (select id
     from table
     where table.text ~ 'some|key|words|we|want') as ti on (other_t = ti.id);

Notice that the regex is always the same.

I do not want to index the column directly because AFAIK indexing text columns results in a relatively huge index that would just be a waste of space. This rules out a partial index per the linked post above. So, I've been trying to use an expression index, which I should be able to do because the regex is always the same.

However, I can't get Postgres to use the index!

I've done a vacuum analyze, I've tried every type of query I can imagine, but it always just does a Seq Scan.

The column averages ~85 characters and typically only has one or two of the keywords we're looking for if it is a match.

Any ideas?

I am not interested in full keyword search. We have a finite, stable list of words we're looking for. I don't want to make an enormous trigram/ts_vector index on this text column when all I really need is an expression index that Postgres decides to use.

We cannot eliminate the text column because it has other uses. We have already thought of adding a column which is essentially a cache of this query, but that's just ugly. If that is the best option, so be it, but we really should be able to use an index here.

Index Creation

create index text_col_idx on table ((text ~ 'some|key|words|we|want'));

Explain Analyze

explain (analyze, verbose) select id from table where text ~ 'some|key|words|we|want';

Results

Seq Scan on public.table  (cost=0.00..405624.24 rows=5190575 width=4) (actual time=0.113..52582.259 rows=5113014 loops=1)
  Output: id
  Filter: ((table.text)::text ~ 'some|key|words|we|want'::text)
  Rows Removed by Filter: 4728206
Planning time: 10.938 ms
Execution time: 52766.607 ms

Best Answer

Community wiki answer:

Your condition only eliminates about 50% of the rows in the table. In that case the Seq Scan will be faster than the Index Scan.

See for example this Stack Overflow answer, where user a_horse_with_no_name says:

If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan.

This is because an index scan requires several IO operations for each row (look up the row in the index, then retrieve the row from the heap). Whereas a sequential scan only requires a single IO for each row - or even less because a block (page) on the disk contains more than one row, so more than one row can be fetched with a single IO operation.

Also see Index-Only Scans in the documentation, which includes:

In principle, index-only scans can be used with expression indexes. [...] However, PostgreSQL's planner is currently not very smart about such cases. It considers a query to be potentially executable by index-only scan only when all columns needed by the query are available from the index.

A separate idea would be to have a keywords table alongside your main one, with one ID per word you are interested in, and an ID related to a specific text, and just populate it as needed. The choice will obviously depend on many other constraints that only you can know.