Postgresql – Speeding up index scan backwards query

indexindex-tuningpostgresqlpostgresql-performance

My app is making the following psql query, and it is running extremely slow:

SELECT COUNT(*) 
FROM (
  SELECT 1 AS one 
  FROM "large_table" 
  WHERE "large_table"."user_id" = 123 
  ORDER BY "large_table"."id" desc 
  LIMIT 1 OFFSET 30
) subquery_for_count;

When I change the ORDER BY to ASC, it runs like 100x quicker. I have the default primary key index on id, and I've experimented with adding an additional index for the id in desc order, however it didn't seem to make a difference.

When I run Explain Analyze, I see that it is using an index scan backwards on the slow query (desc). I tried manually disabling index scans for my session, and found that the query ran in 40seconds instead of 2 minutes, which is a noticeable improvement.

Any idea on what I can do to try and improve the speed of this query when sorting by DESC? I've read that for b-tree indices, it should generally give you the same performance irregardless of sort order, but that does not seem to be the case.

Best Answer

Your query must be using an index on "id" to scan the index in the implied order, and then filtering out everything where "user_id" does not equal 123, stopping after it finds 31 rows which survive the filter. Going in one direction it quickly finds 31 such rows, going in the other direction needs to filter out a large number of rows before 31 survive (because none/few of the rows starting at that end have user_id=123).

You could readily confirm this theory by doing an EXPLAIN (ANALYZE, BUFFERS) of the queries.

This is not fundamentally about the order of the index scan. If you picked a value for 123 which had the opposite property (they all occurred at the logical end of the index rather than the logical beginning) then the situation would be reversed. Specifying DESC would fix the problem, rather than causing it.

Any idea on what I can do to try and improve the speed of this query when sorting by DESC?

Your query seems pointless. Counting is not an order-dependent activity. This is probably not your real query. So who knows if our suggestions would transfer over to your real query? The most straighforward fix for this query would be to build a multicolumn index on (user_id, id). Then no rows would get filtered out one by one, as they would be removed in wholesale through the operation of the index.

Related Solutions

Postgresql – Postgres Index scan forward vs backward = speed difference of 357X slower

Since I like replacing aggregate functions by old-fashioned self-joins and NOT EXISTS clauses, here is my attempt:

SET search_path='tmp';

DROP TABLE tmp.changes CASCADE;
CREATE TABLE tmp.changes
        ( id integer NOT NULL PRIMARY KEY
        , fullname varchar
        , issuer varchar
        , rsymbol varchar
        , industry varchar
        , activity INTEGER NOT NULL
        , shareschange FLOAT
        , sharespchange FLOAT
        , mfiled FLOAT
        );

        -- lacking information from the OP
        -- I can only presume a flat distribution.
INSERT INTO tmp.changes(id, activity, shareschange,sharespchange,mfiled )
SELECT nm.*
        , (random() *20)::integer -- mfiled
        , random() *10000
        , random() *100
        , random() *100000
FROM generate_series(1,1000000) nm
        ;

ALTER TABLE tmp.changes
        ALTER shareschange
        SET STATISTICS 1000
        ;
ALTER TABLE tmp.changes
        ALTER mfiled
        SET STATISTICS 1000
        ;

VACUUM ANALYZE tmp.changes
        ;


CREATE INDEX changes_mfiled_shareschange
    ON tmp.changes(mfiled,shareschange)
        ;

EXPLAIN ANALYZE
SELECT initcap(ch.fullname) AS some_name1
     , initcap(ch.issuer) AS some_name2
     , upper(ch.rsymbol) AS some_name3
     , initcap(ch.industry) AS some_name4
     , ch.activity
     , to_char(ch.shareschange,'FM9,999,999,999,999,999') AS some_name5
     , ch.sharespchange || '%' AS some_name6
FROM   changes ch
WHERE  ch.activity IN (4,5)
        -- NOTE: the subquery is *not* correlated.
        -- [I had expected a subselect of nx.activity IN (4,5)
        -- like in the main query. ]
AND    NOT EXISTS (SELECT * FROM changes nx
        WHERE nx.mfiled > ch.mfiled
        )
ORDER  BY ch.shareschange ASC
LIMIT  15
        ;

MySQL query not using an index when table contains many records

The table size is not the villain. It's the estimated number of rows.

The query optimizer, in this case (MyISAM, key starting with rtime, etc), will do something like this:

estimate the percentage of the table to scan, based on "WHERE rtime BETWEEN..."
If that is "small" (say, less than 20%, but that is not a hard number), use the INDEX; else do a table scan.

Step 1 depends on the "statistics" that are kept with the MyISAM table. The stats are usually pretty accurate, but they can become less accurate. ANALYZE TABLE is the fix for that. (I don't think I have ever seen a need for ANALYZE being run more than monthly; usually it is not needed at all.)

The reason for the to-INDEX-or-not-to-INDEX question goes something like this... When using the INDEX, the execution has to bounce between index 'rows' and data rows. The data rows are potentially randomly scattered, leading to (potentially) lots of I/O. Hence, doing a 'table scan' is preferred after some point.

Best Answer

Related Solutions

Postgresql – Postgres Index scan forward vs backward = speed difference of 357X slower

MySQL query not using an index when table contains many records

Related Question