Postgresql – Postgres Index scan forward vs backward = speed difference of 357X slower

postgresql

PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2
20080704 (Red Hat 4.1.2-51), 64-bit

Dedicated DB server

4GB ram
Shared_Buffers = 1 GB
Effective_cache_size = 3GB
Work_mem = 32MB

Analyze done

Queries ran multiple times, same differences/results

Default Statistics = 1000

Query (5366ms) :

explain analyze
select
    initcap (fullname)
  , initcap(issuer)
  , upper(rsymbol)
  , initcap(industry)
  , activity
  , to_char(shareschange,'FM9,999,999,999,999,999')
  , sharespchange || + E'\%'
from changes
where activity in (4,5) and mfiled >= (select max(mfiled) from changes)
order by shareschange asc
limit 15

Slow Ascending explain Analyze:

http://explain.depesz.com/s/zFz

Query (15ms) :

explain analyze
select
    initcap (fullname)
  , initcap(issuer)
  , upper(rsymbol)
  , initcap(industry)
  , activity
  , to_char(shareschange,'FM9,999,999,999,999,999')
  , sharespchange ||+ E'\%'
from changes
where activity in (4,5) and mfiled >= (select max(mfiled) from changes)
order by shareschange desc limit 15

Fast descending explain analyze:

http://explain.depesz.com/s/OP7

The index: changes_shareschange is a btree index created with default
ascending order. The is index size is 32mb

The query plan and estimates are exactly the same, except desc has index
scan backwards instead of index scan for changes_shareschange.

Yet, actual runtime performance is different by 357x slower for the
ascending version instead of descending.

Why and how do I fix it?

Best Answer

Since I like replacing aggregate functions by old-fashioned self-joins and NOT EXISTS clauses, here is my attempt:

SET search_path='tmp';

DROP TABLE tmp.changes CASCADE;
CREATE TABLE tmp.changes
        ( id integer NOT NULL PRIMARY KEY
        , fullname varchar
        , issuer varchar
        , rsymbol varchar
        , industry varchar
        , activity INTEGER NOT NULL
        , shareschange FLOAT
        , sharespchange FLOAT
        , mfiled FLOAT
        );

        -- lacking information from the OP
        -- I can only presume a flat distribution.
INSERT INTO tmp.changes(id, activity, shareschange,sharespchange,mfiled )
SELECT nm.*
        , (random() *20)::integer -- mfiled
        , random() *10000
        , random() *100
        , random() *100000
FROM generate_series(1,1000000) nm
        ;

ALTER TABLE tmp.changes
        ALTER shareschange
        SET STATISTICS 1000
        ;
ALTER TABLE tmp.changes
        ALTER mfiled
        SET STATISTICS 1000
        ;

VACUUM ANALYZE tmp.changes
        ;


CREATE INDEX changes_mfiled_shareschange
    ON tmp.changes(mfiled,shareschange)
        ;

EXPLAIN ANALYZE
SELECT initcap(ch.fullname) AS some_name1
     , initcap(ch.issuer) AS some_name2
     , upper(ch.rsymbol) AS some_name3
     , initcap(ch.industry) AS some_name4
     , ch.activity
     , to_char(ch.shareschange,'FM9,999,999,999,999,999') AS some_name5
     , ch.sharespchange || '%' AS some_name6
FROM   changes ch
WHERE  ch.activity IN (4,5)
        -- NOTE: the subquery is *not* correlated.
        -- [I had expected a subselect of nx.activity IN (4,5)
        -- like in the main query. ]
AND    NOT EXISTS (SELECT * FROM changes nx
        WHERE nx.mfiled > ch.mfiled
        )
ORDER  BY ch.shareschange ASC
LIMIT  15
        ;

Related Solutions

Postgresql – Why does this limit make the postgres planner use a much slower index scan instead of a much faster bitmap heap/index scan

The big difference between the first two queries is that int the first one, it could go along the index used by the primary key of the table (and used by the ORDER BY clause), then filter out the rows that don't match the WHERE condition. You can see that it had to visit about 621 rows (the 10 that got returned and 611 which were filtered) to get ready.

Now the second one used the same logic, but not having found a single match (not to mention 10), it had to go through the whole index and throw away all rows (Rows Removed by Filter: 796146).

The second pair, without ordering, chose a different plan, which in this case happened to be more effective for returning 0 rows :)

And the third pair, knowing it has to return lots of rows (it planned 3573 as opposed to 10), again went for a different plan, with a bitmap heap scan (not a bitmap index scan, as in the second pair). The time difference can be attributed mostly to this node:

Sort Method: external merge Disk: 12288kB

If you raised work_mem to a higher value (say 100 MB), this difference would mostly go away, I guess.

Postgresql – Postgres is performing sequential scan instead of index scan

This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.

Your query could be rewritten (the LATERAL needs 9.3+ version):

WITH RECURSIVE pa AS 
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 
  UNION ALL
    SELECT n.labelDate 
    FROM pa AS p
         , LATERAL 
              ( SELECT labelDate 
                FROM pages 
                WHERE labelDate > p.labelDate 
                ORDER BY labelDate 
                LIMIT 1
              ) AS n
) 
SELECT labelDate 
FROM pa ;

Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user

Best Answer

Related Solutions

Postgresql – Why does this limit make the postgres planner use a much slower index scan instead of a much faster bitmap heap/index scan

Postgresql – Postgres is performing sequential scan instead of index scan

Related Question