Postgresql – Why PostreSQL 9.2 does not use index scan if it can

indexpostgresql

My PosgreSQL planner (9.2) does not pick index and rather choose to do seq scan.

Here is my table. The index I am talking about is name_left_prefix

cwu=# \d web_city;
                                  Table "public.web_city"
   Column   |         Type          |                       Modifiers                       
------------+-----------------------+-------------------------------------------------------
 id         | integer               | not null default nextval('web_city_id_seq'::regclass)
 name       | character varying(64) | not null
 latitude   | double precision      | 
 longitude  | double precision      | 
 time_zone  | character varying(64) | not null
 population | bigint                | 
 country_id | integer               | 
 location   | geometry(Point,4326)  | 
Indexes:
    "web_city_pkey" PRIMARY KEY, btree (id)
    "name_left_prefix" btree ("left"(name::text, 5))
    "web_city_country_id" btree (country_id)
    "web_city_location_id" gist (location)
Foreign-key constraints:
    "country_id_refs_id_55d3b49fd200671" FOREIGN KEY (country_id) REFERENCES web_country(id) DEFERRABLE INITIALLY DEFERRED

Now I run the query that is supposed to use that index.

cwu=# explain analyze select count(*) from web_city GROUP BY (left(name,5));
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=19016.35..21188.28 rows=94677 width=10) (actual time=502.308..691.288 rows=64459 loops=1)
   ->  Sort  (cost=19016.35..19345.84 rows=131796 width=10) (actual time=502.300..660.121 rows=131796 loops=1)
         Sort Key: ("left"((name)::text, 5))
         Sort Method: external merge  Disk: 2048kB
         ->  Seq Scan on web_city  (cost=0.00..5554.45 rows=131796 width=10) (actual time=0.077..46.672 rows=131796 loops=1)
 Total runtime: 694.977 ms
(6 rows)

It does not use it though. But it does use it when I disable seq_scan and the resulting time is considerably better!

cwu=# set enable_seqscan=false;
SET
cwu=# explain analyze select count(*) from web_city GROUP BY (left(name,5));
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.00..21237.14 rows=94677 width=10) (actual time=0.187..114.887 rows=64459 loops=1)
   ->  Index Scan using name_left_prefix on web_city  (cost=0.00..19394.69 rows=131796 width=10) (actual time=0.163..79.829 rows=131796 loops=1)
 Total runtime: 118.095 ms
(3 rows)

Why this is happening? I don't really desire to set enable_seqscan=false; from my application before making that query.

EDIT: I have just found out that this does not happen on my server that has PostgreSQL 9.2.3. But it does happen on my laptop with PostgreSQL 9.2.4. Maybe I'll try to downgrade just test it out.

EDIT2: After setting work_mem = 4MB to avoid Sort Method: external merge Disk: 2048kB(Thank you, Igor!) the time has got considerably better for the Seq Scan. Now I approve the decision not to use the index:).

cwu=# explain analyze select count(*) from web_city GROUP BY (left(name,5));
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=6213.43..6671.27 rows=36627 width=10) (actual time=95.879..111.447 rows=64459 loops=1)
   ->  Seq Scan on web_city  (cost=0.00..5554.45 rows=131796 width=10) (actual time=0.424..45.285 rows=131796 loops=1)
 Total runtime: 115.838 ms

Best Answer

PostgreSQL only supports index-only scans since 9.2 and only if the visibility map allows that.

If there is but a single concurrent transaction which has updated a record in a page (and hence there exists a record not visible to all transaction), the visibility map record for this page is invalidated and its records need to be checked for visibility (requiring a heap fetch).

A heap fetch is quite an expensive procedure and it's impossible to tell now many of them would be required on the optimization stage.

So optimizer just tries to play safe and chooses the more predictable path.