Postgresql – Big jump in search time for Postgres Index query for results with high selectivity

indexpostgresql

I am doing some performance comparison of databases and lucene for full-text searching.

So I use Postgres to create an Index for the data to search:

CREATE INDEX bodies_index
  ON bodies
  USING gin
  (to_tsvector('english'::regconfig, body));

and to query it I use

SELECT * 
FROM bodies
WHERE to_tsvector('english', body) @@ plainto_tsquery('english', '" + searchterm + "')"

The results are good for almost all database sizes but there is one exception.
It takes up to 10x longer for the biggest database of 250 000 records,
but only for results with a high selectivity.

Any ideas how this could happen?

EDIT:
My PG version is 9.2.

Here is the result for EXPLAIN ANALYZE for a fast result:
(17k results in under 100ms)

http://explain.depesz.com/s/i1F

And this one is much slower, despite only a small rise in the number of results
(28k results in over 50 seconds)

http://explain.depesz.com/s/WwpX

Best Answer

As I understand your question you are asking why highly selective index scans might be much slower after a certain number of records are returned or after the table reaches a certain size. As it turns out your query plans provide most of the information needed. Understanding of course is the first step in trying to figure out how to solve the problem. It looks to me like your slow query is hitting a much more heavily used db than the fast query.

As I look at your query plans, I see two immediate pieces of bad news.

The first piece of bad news is in the buffer usage. Your rows returned are less than double, but you hit about six times the number of buffer pages. This is bad news because it means that PostgreSQL, following the index scan, is scanning through at least six times more information to find the rows to retrieve. Significantly worse, you shift from the rarely used db of mostly read buffers (fast, but few db-specific services) to shared buffers which require more overhead.

The second piece of bad news is that the recheck condition in the slow query weeds out about three times as many records as are returned. This tells me you need to vacuum and/or reindex this table.

I would recommend vacuum analyze and reindex first, followed if necessary by slightly lowering the shared buffer settings and see if this improves performance.

Too bad

If you cannot change the query at all, that's too bad. You won't get a good solution. If you had not table-qualified the table (~~run.~~frames_stat), you could create a materialized view (see below) with the same name in another schema (or just a temporary one) and adapt the search_path (optionally just in sessions where this is desirable) - for hugely superior performance.

Here's a recipe for such a technique:

How can I fake inet_client_addr() for unit tests in PostgreSQL?

@Joishi's idea with a RULE would be a measure of (desperate) last resort. But I would rather not go there. Too many pitfalls with unexpected behavior.

Better query / indexes

If you could change the query, you should try to emulate a loose index scan:

Optimize GROUP BY query to retrieve latest record per user

This is even more efficient when based on a separate table with one row per relevant run_id - let's call it run_tbl. Create it if you don't have it, yet!
Implemented with correlated subqueries:

SELECT run_id
    , (SELECT frame
       FROM   run.frames_stat
       WHERE  run_id = r.run_id
       ORDER  BY frame DESC NULLS LAST
       LIMIT  1) AS max_frame
    , (SELECT "time"
       FROM   run.frames_stat
       WHERE  run_id = r.run_id
       ORDER  BY "time" DESC NULLS LAST
       LIMIT  1) AS max_time
FROM   run_tbl r;

Create two multicolumn indexes with matching sort order for lightening performance:

CREATE index fun_frame_idx ON run.frames_stat (run_id, frame DESC NULLS LAST);
CREATE index fun_frame_idx ON run.frames_stat (run_id, "time" DESC NULLS LAST);

NULLS LAST is only necessary if there can be null values. But it won't hurt either way.

Unused index in range of dates query

With only 280 distinct run_id, this will be very fast.

MATERIALIZED VIEW

Or, based on these key pieces of information:

The "frames_stat" table has 42 million rows

rows=280 -- number of returned rows = disctinct run_id

The table is unchanging (no inserts/deletes)

Use a MATERIALIZED VIEW, it will be tiny (only 280 rows) and super fast.
You still need to change the query to base it on the MV instead of the table.

Aside: never use reserved words like time (in standard SQL) as identifier.

Best Answer

Related Solutions

Postgresql – Postgres full text search with multiple columns, why concat in index and not at runtime

Postgresql – Postgres Index a query with MAX and groupBy

Too bad

Better query / indexes

MATERIALIZED VIEW

Related Question