I have the following table:
# \d service
Table "public.service"
Column | Type | Modifiers
-------------+----------+-----------
customer_id | integer | not null
date | date | not null
service | smallint | not null
has | boolean |
Indexes:
"service_customer_id_idx" btree (customer_id)
# select count(*) from service;
count
-----------
327535416
(1 row)
Time: 75047.508 ms
# select version();
version
-------------------------------------------------------------------------------------------------------------
PostgreSQL 9.5.4 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 6.1.1 20160621 (Red Hat 6.1.1-3), 64-bit
(1 row)
I tried to come up with a query for which it is obviously beneficial to use the
index, since the results can be taken directly from the index in the correct
order:
# explain (analyze,verbose) select customer_id from service order by customer_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=64804990.92..65623829.48 rows=327535424 width=4) (actual time=487209.250..546557.064 rows=327535416 loops=1)
Output: customer_id
Sort Key: service.customer_id
Sort Method: external merge Disk: 4482448kB
-> Seq Scan on public.service (cost=0.00..5045816.24 rows=327535424 width=4) (actual time=2.705..49659.173 rows=327535416 loops=1)
Output: customer_id
Planning time: 0.093 ms
Execution time: 554914.731 ms
(8 rows)
Time: 554919.649 ms
As you see, Postgres prefers to do a sequential scan and then sort the results.
Interestingly, if I add a limit
clause, it does decide to use the index:
# explain (analyze,verbose) select customer_id from service order by customer_id limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..32.97 rows=10 width=4) (actual time=2.509..2.773 rows=10 loops=1)
Output: customer_id
-> Index Only Scan using service_customer_id_idx on public.service (cost=0.57..1061141647.19 rows=327535424 width=4) (actual time=2.503..2.760 rows=10 loops=1)
Output: customer_id
Heap Fetches: 10
Planning time: 4.285 ms
Execution time: 2.906 ms
(7 rows)
Time: 28.178 ms
Why does Postgres behave this way, and how could I debug this? Is there a way to
ask it to show alternative plans and their cost calculations?
Here's the execution with seqscan turned off:
# explain (analyze,verbose) select customer_id from service order by customer_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using service_customer_id_idx on public.service (cost=0.57..1061141647.19 rows=327535424 width=4) (actual time=2.753..346400.896 rows=327535416 loops=1)
Output: customer_id
Heap Fetches: 327535416
Planning time: 1.921 ms
Execution time: 355637.985 ms
(5 rows)
Time: 355647.367 ms
Best Answer
Generally speaking, reading a tuple from an index is more expensive then reading directly from the relation. This is due to the overhead of traversing the tree as well as occasionally reading from the table (due to changes for example)
Thus, if you scan the entire table Postgres will come to the conclusion that it should scan the table. Limiting the scan immediately changes the above as the number of tuples that need scanning is much more limited.