Postgresql – why index scan is slower then seq scan

postgresql

slava=# explain (analyze) SELECT * from pending_posts WHERE user_id <> 1456
slava-# ;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pending_posts  (cost=0.00..17906.00 rows=999994 width=14) (actual time=0.021..231.803 rows=999994 loops=1)
   Filter: (user_id <> 1456)
   Rows Removed by Filter: 6
 Planning time: 0.108 ms
 Execution time: 289.845 ms
(5 rows)

slava=# explain (analyze) select * from pending_posts where user_id <> 1456 order_by user_id;
ERROR:  syntax error at or near "order_by"
LINE 1: ...select * from pending_posts where user_id <> 1456 order_by u...
                                                             ^
slava=# explain (analyze) select * from pending_posts where user_id <> 1456 order by user_id;
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using p_i on pending_posts  (cost=0.42..50103.17 rows=999994 width=14) (actual time=0.033..899.520 rows=999994 loops=1)
   Filter: (user_id <> 1456)
   Rows Removed by Filter: 6
 Planning time: 0.141 ms
 Execution time: 953.982 ms
(5 rows)

Given that user_id is integer indexed with btree, should't it be faster to sort and then select using tree search?

Best Answer

PostgreSQL will evaluate all (or a high number) of possible query plans, and make a good guess at which one will be the least costly. If the different cost parameters (that depend, for instance, on the speed of your disks; the amount of available memory, etc.) are well set, most of the times that guess will be accurate and the choice of plans will be optimum.

You're asking PostgreSQL two different queries. One doesn't need sorting, the other one does. The rows returned are the same, because the rest of the query is the same. The query with the ORDER BY takes longer. That is to be expected.

PostgreSQL could have decided that performing a Seq Scan then a Sort would be quicker than performing an Index Scan. With the actual settings, it decided the opposite.

Performing an Index Scan takes longer than performing a Seq Scan just because, when scanning the index, PostgreSQL can just skip the entries with (user_id <> 1456) (the info for user_id is in the index). For the non-skipped values, it gets the (internal) reference to the corresponding row (which is also part of the index), and it has to retrieve the actual row in the table. It won't need to sort them, because by following the index, the values are already in the proper order.

With standard settings, the costs for for seq_page_cost and random_page_cost (in arbitrary units) are 4 and 1. That takes into account that, with HDD, jumping from one page to another needs a significant mechanical movement, and needs more time than reading pages one after next. You can assume a Seq Scan will read pages sequentially, and an Index Scan will mostly jump from one to another. [If you use SSD instead of HDD, the value for random_page_cost should probably be put to just 1.05].

If you would be using a covering index (one index that contains at least all the columns that are in your SELECT), then, PostgreSQL would be able to perform an Index Only Scan (meaning it doesn't have to retrieve the whole rows from the actual *table). That would speed up your query significantly. For a covering index to work, your table has to be properly maintained (i.e.: autovacuum must have had time to maintain it).

Try this query and see what the execution plan is:

EXPLAIN (analyze) 
SELECT user_id 
FROM pending_posts 
WHERE user_id <> 1456
ORDER BY user_id ;

If you don't get an Index Only Scan, perform a:

VACUUM ANALYZE pending_posts ;

and retry it.

Note that indexes can help you in several areas:

  1. In the SELECT clause if they are covering indexes. If the indexes contain all the columns specified, Index Only Scans are possible.
  2. In the WHERE clause if they allow to filter out (or go directly to) specific) rows.
  3. In the ORDER BY clause, if they provide already the rows in the sort order you specify.
  4. In the GROUP BY, if the aggregation is done by sorting the rows according to the groups (other techniques are possible).

References: