Postgresql – Big jump in search time for Postgres Index query for results with high selectivity

indexpostgresql

I am doing some performance comparison of databases and lucene for full-text searching.

So I use Postgres to create an Index for the data to search:

CREATE INDEX bodies_index
  ON bodies
  USING gin
  (to_tsvector('english'::regconfig, body));

and to query it I use

SELECT * 
FROM bodies
WHERE to_tsvector('english', body) @@ plainto_tsquery('english', '" + searchterm + "')"

The results are good for almost all database sizes but there is one exception.
It takes up to 10x longer for the biggest database of 250 000 records,
but only for results with a high selectivity.

Any ideas how this could happen?

EDIT:
My PG version is 9.2.

Here is the result for EXPLAIN ANALYZE for a fast result:
(17k results in under 100ms)

http://explain.depesz.com/s/i1F

And this one is much slower, despite only a small rise in the number of results
(28k results in over 50 seconds)

http://explain.depesz.com/s/WwpX

Best Answer

As I understand your question you are asking why highly selective index scans might be much slower after a certain number of records are returned or after the table reaches a certain size. As it turns out your query plans provide most of the information needed. Understanding of course is the first step in trying to figure out how to solve the problem. It looks to me like your slow query is hitting a much more heavily used db than the fast query.

As I look at your query plans, I see two immediate pieces of bad news.

The first piece of bad news is in the buffer usage. Your rows returned are less than double, but you hit about six times the number of buffer pages. This is bad news because it means that PostgreSQL, following the index scan, is scanning through at least six times more information to find the rows to retrieve. Significantly worse, you shift from the rarely used db of mostly read buffers (fast, but few db-specific services) to shared buffers which require more overhead.

The second piece of bad news is that the recheck condition in the slow query weeds out about three times as many records as are returned. This tells me you need to vacuum and/or reindex this table.

I would recommend vacuum analyze and reindex first, followed if necessary by slightly lowering the shared buffer settings and see if this improves performance.