PostgreSQL – Optimize phraseto_tsquery Performance

indexpostgresql

Some background:

I have a table, documents with over 200,000 rows that has a column body, which can be very long. I put an index on this table:

CREATE INDEX documents_body_tsvector_index ON documents USING GIN (to_tsvector('english', body));

With this index, my queries that include WHERE clauses like to_tsvector('english', body) @@ to_tsquery('english', 'awesome') are blazing fast, which is great.

The problem:

While plainto_tsquery('english', 'awesome website') runs fast, phraseto_tsquery('english', 'awesome website') runs extremely slow. (I've also triedto_tsquery('english', 'awesome <-> website') as well, and that is also extremely slow.)

The question:

How can I speed this up? Am I doing something wrong with my index? I really need to take position into account—using phraseto_tsquery seems like the best way to do it. Any help would be much appreciated.

(Edit) The EXPLAIN:

Here's the fast query:

EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*)
FROM documents
WHERE to_tsvector('english', body) @@ plainto_tsquery('english', 'termone termtwo');

Aggregate  (cost=57.16..57.17 rows=1 width=8) (actual time=4.386..4.387 rows=1 loops=1)
  Buffers: shared hit=2949
    ->  Bitmap Heap Scan on documents  (cost=36.04..57.15 rows=5 width=0) (actual time=1.609..4.096 rows=3053 loops=1)
      Recheck Cond: (to_tsvector('english'::regconfig, text) @@ '''termone'' & ''termtwo'''::tsquery)
        Heap Blocks: exact=2907
        Buffers: shared hit=2949
          ->  Bitmap Index Scan on documents_text_tsvector_index  (cost=0.00..36.04 rows=5 width=0) (actual time=1.265..1.265 rows=3053 loops=1)
            Index Cond: (to_tsvector('english'::regconfig, body) @@ '''termone'' & ''termtwo'''::tsquery)
            Buffers: shared hit=42
Planning time: 0.094 ms
Execution time: 4.417 ms
(11 rows)

Here's the slow query:

EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*)
FROM documents
WHERE to_tsvector('english', body) @@ phraseto_tsquery('english', 'termone termtwo');

Aggregate  (cost=57.16..57.17 rows=1 width=8) (actual time=120252.866..120252.866 rows=1 loops=1)
  Buffers: shared hit=38792
    ->  Bitmap Heap Scan on documents  (cost=36.04..57.15 rows=5 width=0) (actual time=3.065..120250.231 rows=1750 loops=1)
      Recheck Cond: (to_tsvector('english'::regconfig, body) @@ '''termone'' <-> ''termtwo'''::tsquery)
        Rows Removed by Index Recheck: 1303
        Heap Blocks: exact=2907
        Buffers: shared hit=38792
          ->  Bitmap Index Scan on documents_text_tsvector_index  (cost=0.00..36.04 rows=5 width=0) (actual time=1.491..1.491 rows=3053 loops=1)
            Index Cond: (to_tsvector('english'::regconfig, body) @@ '''termone'' <-> ''termtwo'''::tsquery)
            Buffers: shared hit=42
Planning time: 0.540 ms
Execution time: 120252.938 ms
(12 rows)

Best Answer

The phrase search has to inspect every document which contains both 'awesom' & 'websit' to make sure the tokens occur in the correct order and proximity, because the order and proximity data is not present in the index. Reparsing, tokenizing, stemming, etc. those documents is slow. Also, since a large document is stored out-of-band, it has to read that large data in from a separate TOAST table before it can do any of that. Without the the word order and proximity constraint, it doesn't have to do any recheck at all, as in this case the index only returns exact results, with no possible false positives (as long as work_mem is large enough).

Rather than using a functional index, you can store the tsvector in the table and keep that column populated (for example, a trigger on row insert and on update of body which sets tsv = to_tsvector('english', body)). Then you would build the index directly on "tsv" column, and write the query directly against that column:

ALTER TABLE documents add column tsv tsvector;
update documents set tsv = to_tsvector('english', body);
CREATE INDEX documents_tsv_tsvector_index ON documents USING GIN (tsv);

SELECT...WHERE tsv @@ phraseto_tsquery('english', 'awesome website')

Here, it still needs to recheck the "tsv" for order and proximity, which means it needs to read that large column out of disk and back into memory, but it doesn't need to reparse it as it is already stored in tokenized form designed for efficiency. How much of an improvement this makes will depend on whether your bottleneck is the IO for reading the documents back into memory, or the CPU for doing the parsing.

Of course that does mean you will be storing both "body" and "tsv" in the table, which will expand its size considerably, but that might be a low cost to pay for the improved performance. If all you want to do is COUNT(*), then you could even store only the tsvector, and don't store the raw "body" at all.