Postgresql is failing to use the covering index and falling back to a much slower bitmap scan

performancepostgresqlpostgresql-9.6postgresql-performance

I'm trying to figure out why a table of mine is using a Bitmap Heap Scan when a index scan is dramatically faster.

Table:

webarchive=# \d web_pages
                                               Table "public.web_pages"
      Column       |            Type             |                              Modifiers
-------------------+-----------------------------+---------------------------------------------------------------------
 id                | bigint                      | not null default nextval('web_pages_id_seq'::regclass)
 state             | dlstate_enum                | not null
 errno             | integer                     |
 url               | text                        | not null
 starturl          | text                        | not null
 netloc            | text                        | not null
 file              | bigint                      |
 priority          | integer                     | not null
 distance          | integer                     | not null
 is_text           | boolean                     |
 limit_netloc      | boolean                     |
 title             | citext                      |
 mimetype          | text                        |
 type              | itemtype_enum               |
 content           | text                        |
 fetchtime         | timestamp without time zone |
 addtime           | timestamp without time zone |
 normal_fetch_mode | boolean                     | default true
 ignoreuntiltime   | timestamp without time zone | not null default '1970-01-01 00:00:00'::timestamp without time zone
Indexes:
    "web_pages_pkey" PRIMARY KEY, btree (id)
    "ix_web_pages_url" UNIQUE, btree (url)
    "ix_web_pages_distance" btree (distance)
    "ix_web_pages_fetchtime" btree (fetchtime)
    "ix_web_pages_id" btree (id)
    "ix_web_pages_id_state" btree (id, state)
    "ix_web_pages_netloc" btree (netloc)
    "ix_web_pages_priority" btree (priority)
    "ix_web_pages_state" btree (state)
    "web_pages_netloc_fetchtime_idx" btree (netloc, fetchtime)
    "web_pages_netloc_id_idx" btree (netloc, id)
Foreign-key constraints:
    "web_pages_file_fkey" FOREIGN KEY (file) REFERENCES web_files(id)
Tablespace: "main_webarchive_tablespace"

Query:

EXPLAIN ANALYZE UPDATE
    web_pages
SET
    state = 'new'
WHERE
    (state = 'fetching' OR state = 'processing')
AND
    id <= 150000000;

In this case, since I have a covering index (ix_web_pages_id_state), I'd expect the query planner to do a index only scan. However, it's instead generating a bitmap heap scan, which is dramatically slower:

                                                                          QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on web_pages  (cost=524.06..532.09 rows=2 width=671) (actual time=2356.900..2356.900 rows=0 loops=1)
   ->  Bitmap Heap Scan on web_pages  (cost=524.06..532.09 rows=2 width=671) (actual time=2356.896..2356.896 rows=0 loops=1)
         Recheck Cond: (((state = 'fetching'::dlstate_enum) OR (state = 'processing'::dlstate_enum)) AND (id <= 150000000))
         Heap Blocks: exact=6
         ->  BitmapAnd  (cost=524.06..524.06 rows=2 width=0) (actual time=2353.388..2353.388 rows=0 loops=1)
               ->  BitmapOr  (cost=151.98..151.98 rows=6779 width=0) (actual time=2021.635..2021.636 rows=0 loops=1)
                     ->  Bitmap Index Scan on ix_web_pages_state  (cost=0.00..147.41 rows=6779 width=0) (actual time=2021.616..2021.617 rows=11668131 loops=1)
                           Index Cond: (state = 'fetching'::dlstate_enum)
                     ->  Bitmap Index Scan on ix_web_pages_state  (cost=0.00..4.57 rows=1 width=0) (actual time=0.015..0.016 rows=0 loops=1)
                           Index Cond: (state = 'processing'::dlstate_enum)
               ->  Bitmap Index Scan on web_pages_pkey  (cost=0.00..371.83 rows=16435 width=0) (actual time=0.046..0.047 rows=205 loops=1)
                     Index Cond: (id <= 150000000)
 Planning time: 0.232 ms
 Execution time: 2406.234 ms
(14 rows)

If I force it to not do a bitmap heap scan (by set enable_bitmapscan to off;), it generates a MUCH faster plan:

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Update on web_pages  (cost=0.56..38591.75 rows=2 width=671) (actual time=0.284..0.285 rows=0 loops=1)
   ->  Index Scan using web_pages_pkey on web_pages  (cost=0.56..38591.75 rows=2 width=671) (actual time=0.281..0.281 rows=0 loops=1)
         Index Cond: (id <= 150000000)
         Filter: ((state = 'fetching'::dlstate_enum) OR (state = 'processing'::dlstate_enum))
         Rows Removed by Filter: 181
 Planning time: 0.190 ms
 Execution time: 0.334 ms
(7 rows)

I have re-ran a vacuum analyze to see if it were possible the table statistics were out of date, but that didn't seem to have any benefit. Also, the above are after having re-ran the same query several times, so I think caching shouldn't be relevant either.

How can I induce the planner to generate a more performant plan here?


Edit: As suggested in the comments, I added an index "ix_web_pages_state_id" btree (state, id). Unfortunately, it didn't help.

I've also experimented with reducing the random_page_cost (to as low as 0.5), as well as increasing the statistics target, neither of which had any effect.


Further edit – Removing the OR condition:

EXPLAIN ANALYZE UPDATE
    web_pages
SET
    state = 'new'
WHERE
    state = 'fetching'
AND
    id <= 150000000;

Yields:

                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on web_pages  (cost=311.83..315.84 rows=1 width=589) (actual time=2574.654..2574.655 rows=0 loops=1)
   ->  Bitmap Heap Scan on web_pages  (cost=311.83..315.84 rows=1 width=589) (actual time=2574.650..2574.651 rows=0 loops=1)
         Recheck Cond: ((id <= 150000000) AND (state = 'fetching'::dlstate_enum))
         Heap Blocks: exact=6
         ->  BitmapAnd  (cost=311.83..311.83 rows=1 width=0) (actual time=2574.556..2574.556 rows=0 loops=1)
               ->  Bitmap Index Scan on web_pages_pkey  (cost=0.00..49.60 rows=1205 width=0) (actual time=0.679..0.680 rows=726 loops=1)
                     Index Cond: (id <= 150000000)
               ->  Bitmap Index Scan on ix_web_pages_state  (cost=0.00..261.98 rows=7122 width=0) (actual time=2519.950..2519.951 rows=11873888 loops=1)
                     Index Cond: (state = 'fetching'::dlstate_enum)

Further edit – MOAR WEIRDNESS:

I rewrote the query to use a subquery:

EXPLAIN ANALYZE UPDATE
    web_pages
SET
    state = 'new'
WHERE
    (state = 'fetching' OR state = 'processing')
AND
    id IN (
        SELECT 
            id 
        FROM 
            web_pages 
        WHERE 
            id <= 150000000
    );

and that yields a resulting execution plan that outperforms all the others so far. sometimes:

                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on web_pages  (cost=1.12..13878.31 rows=1 width=595) (actual time=2.773..2.774 rows=0 loops=1)
   ->  Nested Loop  (cost=1.12..13878.31 rows=1 width=595) (actual time=2.772..2.773 rows=0 loops=1)
         ->  Index Scan using web_pages_pkey on web_pages web_pages_1  (cost=0.56..3533.34 rows=1205 width=14) (actual time=0.000..0.602 rows=181 loops=1)
               Index Cond: (id <= 150000000)
         ->  Index Scan using web_pages_pkey on web_pages  (cost=0.56..8.58 rows=1 width=585) (actual time=0.010..0.010 rows=0 loops=181)
               Index Cond: (id = web_pages_1.id)
               Filter: ((state = 'fetching'::dlstate_enum) OR (state = 'processing'::dlstate_enum))
               Rows Removed by Filter: 1
 Planning time: 0.891 ms
 Execution time: 2.894 ms
(10 rows)

Update on web_pages  (cost=21193.19..48917.78 rows=2 width=595)
  ->  Hash Semi Join  (cost=21193.19..48917.78 rows=2 width=595)
        Hash Cond: (web_pages.id = web_pages_1.id)
        ->  Bitmap Heap Scan on web_pages  (cost=270.14..27976.00 rows=7126 width=585)
              Recheck Cond: ((state = 'fetching'::dlstate_enum) OR (state = 'processing'::dlstate_enum))
              ->  BitmapOr  (cost=270.14..270.14 rows=7126 width=0)
                    ->  Bitmap Index Scan on ix_web_pages_state  (cost=0.00..262.01 rows=7126 width=0)
                          Index Cond: (state = 'fetching'::dlstate_enum)
                    ->  Bitmap Index Scan on ix_web_pages_state  (cost=0.00..4.57 rows=1 width=0)
                          Index Cond: (state = 'processing'::dlstate_enum)
        ->  Hash  (cost=20834.15..20834.15 rows=7112 width=14)
              ->  Index Scan using web_pages_pkey on web_pages web_pages_1  (cost=0.56..20834.15 rows=7112 width=14)
                    Index Cond: ((id > 1883250000) AND (id <= 1883300000))

I have no idea what's going on, at this point. All I know is every case is fixed by set enable_bitmapscan to off;.


Ok, the extremely long running transaction I had running finished last night, and I managed to run a VACUUM VERBOSE ANALYZE on the table:

webarchive=# VACUUM ANALYZE VERBOSE web_pages;
INFO:  vacuuming "public.web_pages"
INFO:  scanned index "ix_web_pages_distance" to remove 33328301 row versions
DETAIL:  CPU 6.85s/21.21u sec elapsed 171.28 sec
INFO:  scanned index "ix_web_pages_fetchtime" to remove 33328301 row versions
DETAIL:  CPU 6.20s/25.28u sec elapsed 186.53 sec
INFO:  scanned index "ix_web_pages_id" to remove 33328301 row versions
DETAIL:  CPU 7.37s/29.56u sec elapsed 226.49 sec
INFO:  scanned index "ix_web_pages_netloc" to remove 33328301 row versions
DETAIL:  CPU 8.47s/41.44u sec elapsed 260.50 sec
INFO:  scanned index "ix_web_pages_priority" to remove 33328301 row versions
DETAIL:  CPU 5.65s/16.35u sec elapsed 180.78 sec
INFO:  scanned index "ix_web_pages_state" to remove 33328301 row versions
DETAIL:  CPU 4.51s/21.14u sec elapsed 189.60 sec
INFO:  scanned index "ix_web_pages_url" to remove 33328301 row versions
DETAIL:  CPU 26.59s/78.52u sec elapsed 969.99 sec
INFO:  scanned index "web_pages_netloc_fetchtime_idx" to remove 33328301 row versions
DETAIL:  CPU 8.23s/48.39u sec elapsed 301.37 sec
INFO:  scanned index "web_pages_netloc_id_idx" to remove 33328301 row versions
DETAIL:  CPU 15.52s/43.25u sec elapsed 423.68 sec
INFO:  scanned index "web_pages_pkey" to remove 33328301 row versions
DETAIL:  CPU 8.12s/33.43u sec elapsed 215.93 sec
INFO:  scanned index "ix_web_pages_id_state" to remove 33328301 row versions
DETAIL:  CPU 8.22s/33.26u sec elapsed 214.43 sec
INFO:  scanned index "ix_web_pages_state_id" to remove 33328301 row versions
DETAIL:  CPU 6.01s/28.04u sec elapsed 174.19 sec
INFO:  "web_pages": removed 33328301 row versions in 3408348 pages
DETAIL:  CPU 89.90s/50.24u sec elapsed 1928.70 sec
INFO:  index "ix_web_pages_distance" now contains 29463963 row versions in 215671 pages
DETAIL:  33328301 index row versions were removed.
32914 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "ix_web_pages_fetchtime" now contains 29463982 row versions in 253375 pages
DETAIL:  33328301 index row versions were removed.
40460 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "ix_web_pages_id" now contains 29464000 row versions in 238212 pages
DETAIL:  33328301 index row versions were removed.
21081 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "ix_web_pages_netloc" now contains 29464025 row versions in 358150 pages
DETAIL:  33328301 index row versions were removed.
99235 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "ix_web_pages_priority" now contains 29464032 row versions in 214923 pages
DETAIL:  33328301 index row versions were removed.
21451 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "ix_web_pages_state" now contains 29466359 row versions in 215150 pages
DETAIL:  33328301 index row versions were removed.
81340 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "ix_web_pages_url" now contains 29466350 row versions in 1137027 pages
DETAIL:  33197635 index row versions were removed.
236405 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "web_pages_netloc_fetchtime_idx" now contains 29466381 row versions in 539255 pages
DETAIL:  33328301 index row versions were removed.
220594 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "web_pages_netloc_id_idx" now contains 29466392 row versions in 501276 pages
DETAIL:  33328301 index row versions were removed.
144217 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "web_pages_pkey" now contains 29466394 row versions in 236560 pages
DETAIL:  33173411 index row versions were removed.
20559 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "ix_web_pages_id_state" now contains 29466415 row versions in 256699 pages
DETAIL:  33328301 index row versions were removed.
27194 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "ix_web_pages_state_id" now contains 29466435 row versions in 244076 pages
DETAIL:  33328301 index row versions were removed.
91918 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "web_pages": found 33339704 removable, 29367176 nonremovable row versions in 4224021 out of 4231795 pages
DETAIL:  2541 dead row versions cannot be removed yet.
There were 2079389 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 330.54s/537.34u sec elapsed 7707.90 sec.
INFO:  vacuuming "pg_toast.pg_toast_705758310"
INFO:  scanned index "pg_toast_705758310_index" to remove 7184381 row versions
DETAIL:  CPU 7.32s/13.70u sec elapsed 240.71 sec
INFO:  "pg_toast_705758310": removed 7184381 row versions in 2271192 pages
DETAIL:  CPU 62.81s/46.41u sec elapsed 1416.12 sec
INFO:  index "pg_toast_705758310_index" now contains 114558558 row versions in 338256 pages
DETAIL:  7184381 index row versions were removed.
2033 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_705758310": found 7184381 removable, 40907769 nonremovable row versions in 11388831 out of 29033065 pages
DETAIL:  5 dead row versions cannot be removed yet.
There were 74209 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 433.26s/247.73u sec elapsed 8444.85 sec.
INFO:  analyzing "public.web_pages"
INFO:  "web_pages": scanned 600000 of 4232727 pages, containing 4191579 live rows and 4552 dead rows; 600000 rows in sample, 29569683 estimated total rows
VACUUM

It's still generating the non-index-only query, though the execution time is much more inline with the index-only query. I don't understand why the behaviour has changed so much. Does having a very long running query cause that much overhead?

webarchive=# EXPLAIN ANALYZE UPDATE
        web_pages
    SET
        state = 'new'
    WHERE
        (state = 'fetching' OR state = 'processing')
    AND
        id IN (
            SELECT
                id
            FROM
                web_pages
            WHERE
                id > 1883250000
            AND
                id <= 1883300000
        );
                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on web_pages  (cost=36.00..9936.00 rows=1 width=594) (actual time=37.856..37.857 rows=0 loops=1)
   ->  Nested Loop Semi Join  (cost=36.00..9936.00 rows=1 width=594) (actual time=37.852..37.853 rows=0 loops=1)
         ->  Bitmap Heap Scan on web_pages  (cost=35.44..3167.00 rows=788 width=584) (actual time=23.984..31.489 rows=2321 loops=1)
               Recheck Cond: ((state = 'fetching'::dlstate_enum) OR (state = 'processing'::dlstate_enum))
               Heap Blocks: exact=2009
               ->  BitmapOr  (cost=35.44..35.44 rows=788 width=0) (actual time=22.347..22.348 rows=0 loops=1)
                     ->  Bitmap Index Scan on ix_web_pages_state  (cost=0.00..30.47 rows=788 width=0) (actual time=22.326..22.327 rows=9202 loops=1)
                           Index Cond: (state = 'fetching'::dlstate_enum)
                     ->  Bitmap Index Scan on ix_web_pages_state_id  (cost=0.00..4.57 rows=1 width=0) (actual time=0.017..0.017 rows=0 loops=1)
                           Index Cond: (state = 'processing'::dlstate_enum)
         ->  Index Scan using ix_web_pages_id_state on web_pages web_pages_1  (cost=0.56..8.58 rows=1 width=14) (actual time=0.001..0.001 rows=0 loops=2321)
               Index Cond: ((id = web_pages.id) AND (id > 1883250000) AND (id <= 1883300000))
 Planning time: 2.677 ms
 Execution time: 37.945 ms
(14 rows)

Interestingly, the value of the ID offset seems to affect the planning:

webarchive=# EXPLAIN ANALYZE UPDATE
        web_pages
    SET
        state = 'new'
    WHERE
        (state = 'fetching' OR state = 'processing')
    AND
        id IN (
            SELECT
                id
            FROM
                web_pages
            WHERE
                id >  149950000
            AND
                id <= 150000000
        );
                                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on web_pages  (cost=1.12..17.18 rows=1 width=594) (actual time=0.030..0.031 rows=0 loops=1)
   ->  Nested Loop  (cost=1.12..17.18 rows=1 width=594) (actual time=0.026..0.028 rows=0 loops=1)
         ->  Index Scan using ix_web_pages_id_state on web_pages web_pages_1  (cost=0.56..8.58 rows=1 width=14) (actual time=0.022..0.024 rows=0 loops=1)
               Index Cond: ((id > 149950000) AND (id <= 150000000))
         ->  Index Scan using ix_web_pages_id_state on web_pages  (cost=0.56..8.59 rows=1 width=584) (never executed)
               Index Cond: (id = web_pages_1.id)
               Filter: ((state = 'fetching'::dlstate_enum) OR (state = 'processing'::dlstate_enum))
 Planning time: 1.531 ms
 Execution time: 0.155 ms
(9 rows)

Does the query planner take into account the value of the parameters for the query in it's planning? I'd have thought planning would be agnostic of query parameters, but considering it now, using the parameters to improve the planning makes sense, so I can see it working that way.

Interestingly enough, the bitmap scan seems to be far more performant now;

webarchive=# set enable_bitmapscan to off;
SET
webarchive=#     EXPLAIN ANALYZE UPDATE
        web_pages
    SET
        state = 'new'
    WHERE
        (state = 'fetching' OR state = 'processing')
    AND
        id IN (
            SELECT
                id
            FROM
                web_pages
            WHERE
                id > 1883250000
            AND
                id <= 1883300000
        );
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on web_pages  (cost=1.12..82226.59 rows=1 width=594) (actual time=66.993..66.994 rows=0 loops=1)
   ->  Nested Loop  (cost=1.12..82226.59 rows=1 width=594) (actual time=66.992..66.993 rows=0 loops=1)
         ->  Index Scan using web_pages_pkey on web_pages web_pages_1  (cost=0.56..21082.82 rows=7166 width=14) (actual time=0.055..20.206 rows=8567 loops=1)
               Index Cond: ((id > 1883250000) AND (id <= 1883300000))
         ->  Index Scan using web_pages_pkey on web_pages  (cost=0.56..8.52 rows=1 width=584) (actual time=0.004..0.004 rows=0 loops=8567)
               Index Cond: (id = web_pages_1.id)
               Filter: ((state = 'fetching'::dlstate_enum) OR (state = 'processing'::dlstate_enum))
               Rows Removed by Filter: 1
 Planning time: 1.963 ms
 Execution time: 67.112 ms
(10 rows)

webarchive=# set enable_bitmapscan to on;
SET
webarchive=#     EXPLAIN ANALYZE UPDATE
        web_pages
    SET
        state = 'new'
    WHERE
        (state = 'fetching' OR state = 'processing')
    AND
        id IN (
            SELECT
                id
            FROM
                web_pages
            WHERE
                id > 1883250000
            AND
                id <= 1883300000
        );
                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on web_pages  (cost=36.00..9936.00 rows=1 width=594) (actual time=23.331..23.331 rows=0 loops=1)
   ->  Nested Loop Semi Join  (cost=36.00..9936.00 rows=1 width=594) (actual time=23.327..23.328 rows=0 loops=1)
         ->  Bitmap Heap Scan on web_pages  (cost=35.44..3167.00 rows=788 width=584) (actual time=6.727..17.027 rows=1966 loops=1)
               Recheck Cond: ((state = 'fetching'::dlstate_enum) OR (state = 'processing'::dlstate_enum))
               Heap Blocks: exact=3825
               ->  BitmapOr  (cost=35.44..35.44 rows=788 width=0) (actual time=3.499..3.499 rows=0 loops=1)
                     ->  Bitmap Index Scan on ix_web_pages_state  (cost=0.00..30.47 rows=788 width=0) (actual time=3.471..3.472 rows=21996 loops=1)
                           Index Cond: (state = 'fetching'::dlstate_enum)
                     ->  Bitmap Index Scan on ix_web_pages_state_id  (cost=0.00..4.57 rows=1 width=0) (actual time=0.022..0.023 rows=0 loops=1)
                           Index Cond: (state = 'processing'::dlstate_enum)
         ->  Index Scan using ix_web_pages_id_state on web_pages web_pages_1  (cost=0.56..8.58 rows=1 width=14) (actual time=0.001..0.001 rows=0 loops=1966)
               Index Cond: ((id = web_pages.id) AND (id > 1883250000) AND (id <= 1883300000))
 Planning time: 0.774 ms
 Execution time: 23.425 ms
(14 rows)

So I think the issue was just the index having LOTS of rows that were no longer valid, and the process of filtering those was the primary time cost. The underlying issue here is, (I think) the way the MVCC system interacts with the VACUUM system in the context of extremely long running transactions.

It would make sense (in retrospect), that entries cannot be removed from an index until every single transaction that could use that index has been completed. From the documentation:

But there is an additional requirement for any table scan in
PostgreSQL: it must verify that each retrieved row be "visible" to the
query's MVCC snapshot, as discussed in Chapter 13. Visibility
information is not stored in index entries, only in heap entries; so
at first glance it would seem that every row retrieval would require a
heap access anyway. And this is indeed the case, if the table row has
been modified recently. However, for seldom-changing data there is a
way around this problem.

In this case, I started a db dump, and then went on and did a bunch of cleanup (which involved a LOT of row churn). That would lead to lots of heap lookups for each index query, since the index contained lots of now-deleted rows.

This is mostly hypothetical, though, since I don't have the resources to try to recreate the situation.

Anyways, @jjanes's hint about long-running queries was the key to finding my way down the rabbit hole here.

Best Answer

This part is clearly deranged:

->  Bitmap Index Scan on ix_web_pages_state  
     (cost=0.00..147.41 rows=6779 width=0) 
     (actual time=2021.616..2021.617 rows=11668131 loops=1)
Index Cond: (state = 'fetching'::dlstate_enum)

It is finding over a thousand times more rows than it thinks. A VACUUM ANALYZE should have fixed this. Perhaps your vacuum didn't do much because some long-open transaction is preventing it from removing dead tuples?

Doing VACUUM VERBOSE ANALYZE could shed some light on that.

You can also explore this further by using the test query:

EXPLAIN (ANALYZE,BUFFERS) SELECT COUNT(*) FROM web_pages WHERE state = 'fetching'

Try to get it to run as both a bitmap scan, and regular index scan by fiddling with enable_* parameters, so we can compare the two reports. In particular, a held-open transaction will lead to the actual count for the bitmap index scan node being higher, as it must count rows before testing for visibility, while the ordinary index scan will discard invisible rows before increment the count.