Postgresql – Postgres not using expression index

indexperformancepostgresqlpostgresql-9.5query-performanceregular expression

I am using Postgres 9.5. It is possible this is related to the changes in 9.6 pointed out by PostgreSQL not using index during count(*), but I cannot be sure yet.

I have a table that has a character varying column called text. I am attempting to optimize queries which have to perform a regex on this column.

This often looks like

select * from table where text ~ 'some|key|words|we|want';

or the inverse

select * from table where text !~ 'some|key|words|we|want';

sometimes in joins

select category, count(*)
from other_t join table on (other_t.table_id = table.id)
where table.text ~ 'some|key|words|we|want';

select category, count(*)
from other_t join
    (select id
     from table
     where table.text ~ 'some|key|words|we|want') as ti on (other_t = ti.id);

Notice that the regex is always the same.

I do not want to index the column directly because AFAIK indexing text columns results in a relatively huge index that would just be a waste of space. This rules out a partial index per the linked post above. So, I've been trying to use an expression index, which I should be able to do because the regex is always the same.

However, I can't get Postgres to use the index!

I've done a vacuum analyze, I've tried every type of query I can imagine, but it always just does a Seq Scan.

The column averages ~85 characters and typically only has one or two of the keywords we're looking for if it is a match.

Any ideas?

I am not interested in full keyword search. We have a finite, stable list of words we're looking for. I don't want to make an enormous trigram/ts_vector index on this text column when all I really need is an expression index that Postgres decides to use.

We cannot eliminate the text column because it has other uses. We have already thought of adding a column which is essentially a cache of this query, but that's just ugly. If that is the best option, so be it, but we really should be able to use an index here.

Index Creation

create index text_col_idx on table ((text ~ 'some|key|words|we|want'));

Explain Analyze

explain (analyze, verbose) select id from table where text ~ 'some|key|words|we|want';

Results

Seq Scan on public.table  (cost=0.00..405624.24 rows=5190575 width=4) (actual time=0.113..52582.259 rows=5113014 loops=1)
  Output: id
  Filter: ((table.text)::text ~ 'some|key|words|we|want'::text)
  Rows Removed by Filter: 4728206
Planning time: 10.938 ms
Execution time: 52766.607 ms

Best Answer

_{Community wiki answer:}

Your condition only eliminates about 50% of the rows in the table. In that case the Seq Scan will be faster than the Index Scan.

See for example this Stack Overflow answer, where user a_horse_with_no_name says:

If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan.

This is because an index scan requires several IO operations for each row (look up the row in the index, then retrieve the row from the heap). Whereas a sequential scan only requires a single IO for each row - or even less because a block (page) on the disk contains more than one row, so more than one row can be fetched with a single IO operation.

Also see Index-Only Scans in the documentation, which includes:

In principle, index-only scans can be used with expression indexes. [...] However, PostgreSQL's planner is currently not very smart about such cases. It considers a query to be potentially executable by index-only scan only when all columns needed by the query are available from the index.

A separate idea would be to have a keywords table alongside your main one, with one ID per word you are interested in, and an ID related to a specific text, and just populate it as needed. The choice will obviously depend on many other constraints that only you can know.

Related Solutions

Postgresql – Postgres not using the index even when rows returned is 5% of the table

From your query plans, it looks like you're comparing ints to ints in the first query plan, and int to numeric in the second plan.

Your first compare:

Index Cond: (("timestamp" >= 1431100800) AND ("timestamp" <= 1431108000))

and

timestamp >= 1431100800 and timestamp <= 1431108000

In the second query, it's numeric values:

Filter: ((numvalues[1] IS NOT NULL) AND (("timestamp")::numeric >= 1431100800.00) AND (("timestamp")::numeric <= 1431108000.00))

and

timestamp >= 1431093600.00 and timestamp <= 1431100800.00

Casting to numeric causes the index to be ignored in favor of a sequential scan.

You can see this with a very simple example, set up below:

CREATE TABLE t2 (a int);
CREATE INDEX t2_a_idx ON t2(a);
INSERT INTO t2 (a) SELECT i FROM generate_series(1,1000000) AS i;
VACUUM ANALYZE VERBOSE t2;

My first query plan looks like this:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE a > 750000;

Index Only Scan using t2_a_idx on t2 (cost=0.42..7134.65 rows=250413 width=4) 
(actual time=0.019..29.926 rows=250000 loops=1)
Index Cond: (a > 750000)
Heap Fetches: 0
Planning time: 0.137 ms
Execution time: 39.114 ms
(5 rows)
Time: 39.540 ms

While a second query using numerics looks like this:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE a > 750000.00;

Seq Scan on t2  (cost=0.00..19425.00 rows=333333 width=4) (actual time=122.803..175.326 rows=250000 loops=1)
Filter: ((a)::numeric > 750000.00)
Rows Removed by Filter: 750000
Planning time: 0.058 ms
Execution time: 184.194 ms
(5 rows)
Time: 184.487 ms

In the second instance here, the index is ignored in favor of a sequential scan because of the cast to a numeric value, which looks like exactly what's happening in your two examples.

One last aside, you might be able to speed your query up via a SET query before executing it as well:

SET work_mem = 2GB;

If your server can handle it, because your sorts are spilling to disk, as noted in this line from your query plan:

Sort Method: external merge  Disk: 1387704kB

Hope this helps. =)

Postgresql – Why seq-scan can be much faster than index-scan and index-only-scan in this simple query

This is quite a common query (pardon the pun! :-) ) from people running queries which perform full table scans (FTS), when the poster feels that the system should make use of the index(es).

Basically, it boils down to the explanation given here. If the tables are so small, the optimiser will say that "it's not worth the bother of going to the index, doing a lookup and then fetching the data, instead, I'll just slurp in all the data and pick out what I need", i.e. perform an FTS.

[EDIT in answer to @txsing's comment]

For an MVCC (multi-version concurrency control) database, you have to traverse every record for a count at a given moment - that's why, for example, a COUNT(*) is much more expensive for MySQL's InnoDB rather than MyISAM.

An excellent explantion (for PostgreSQL) is available here. The guy who wrote this post is a "major contributor" to PostgreSQL (thanks to @dezso for leading me to that post).