Postgresql – Postgres prefers slow Seq Scan over fast Index Scan

cardinality-estimatesexecution-planpostgresqlpostgresql-9.6postgresql-performance

Tables

I have the following two tables.

books, which contains ~5 million rows:

                                  Table "public.books"
 Column  |           Type           |                     Modifiers
---------+--------------------------+----------------------------------------------------
 id      | integer                  | not null default nextval('books_id_seq'::regclass)
 run__id | integer                  |
 time    | timestamp with time zone | not null
 venue   | character varying        | not null
 base    | character varying        | not null
 quote   | character varying        | not null
Indexes:
    "books_pkey" PRIMARY KEY, btree (id)
    "run_books_index" UNIQUE, btree (run__id, id)
Foreign-key constraints:
    "books_run__id_fkey" FOREIGN KEY (run__id) REFERENCES runs(id)
Referenced by:
    TABLE "orders" CONSTRAINT "orders_book__id_fkey" FOREIGN KEY (book__id) REFERENCES books(id)

and orders, which contains ~3 billion rows:

          Table "public.orders"
  Column  |       Type       | Modifiers
----------+------------------+-----------
 book__id | integer          | not null
 is_buy   | boolean          | not null
 qty      | double precision | not null
 price    | double precision | not null
Indexes:
    "orders_pkey" PRIMARY KEY, btree (book__id, is_buy, price)
Foreign-key constraints:
    "orders_book__id_fkey" FOREIGN KEY (book__id) REFERENCES books(id)

Query

I want to run the following query:

SELECT * FROM books b JOIN orders o ON o.book__id = b.id WHERE b.run__id = 1

Postgres suggests the following execution plan:

orderbooks=> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM books b JOIN orders o ON o.book__id = b.id WHERE b.run__id = 1;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=2465.94..58122020.57 rows=762939 width=53) (actual time=1394110.723..2561879.775 rows=45216 loops=1)
   Hash Cond: (o.book__id = b.id)
   Buffers: shared hit=4080 read=18437586
   ->  Seq Scan on orders o  (cost=0.00..47292761.72 rows=2885110272 width=21) (actual time=0.018..2265529.184 rows=2883798728 loops=1)
         Buffers: shared hit=4073 read=18437586
   ->  Hash  (cost=2448.52..2448.52 rows=1393 width=32) (actual time=0.024..0.024 rows=15 loops=1)
         Buckets: 2048  Batches: 1  Memory Usage: 17kB
         Buffers: shared hit=4
         ->  Index Scan using run_books_index on books b  (cost=0.43..2448.52 rows=1393 width=32) (actual time=0.011..0.012 rows=15 loops=1)
               Index Cond: (run__id = 1)
               Buffers: shared hit=4
 Planning time: 2.228 ms
 Execution time: 2561882.272 ms
(13 rows)

Ie. sequentially scanning through the ~3 billion rows in orders, which takes ~40 minutes.

If I disable sequential scans, Postgres suggests the following — much faster — query:

orderbooks=> SET enable_seqscan = OFF;
SET
orderbooks=> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM books b JOIN orders o ON o.book__id = b.id WHERE b.run__id = 1;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1.14..219271454.14 rows=762939 width=53) (actual time=0.024..15.234 rows=45216 loops=1)
   Buffers: shared hit=707
   ->  Index Scan using run_books_index on books b  (cost=0.43..2448.52 rows=1393 width=32) (actual time=0.011..0.014 rows=15 loops=1)
         Index Cond: (run__id = 1)
         Buffers: shared hit=4
   ->  Index Scan using orders_pkey on orders o  (cost=0.70..156529.57 rows=87819 width=21) (actual time=0.007..0.551 rows=3014 loops=15)
         Index Cond: (book__id = b.id)
         Buffers: shared hit=703
 Planning time: 0.302 ms
 Execution time: 18.054 ms
(10 rows)

Question

Why does Postgres prefer the slow execution plan? I see that its cost is less than the fast one, how come?

Notes

  1. I ran VACUUM (VERBOSE, ANALYZE) shortly before trying to execute/analyze the queries.

  2. If I don't request the column orders.qty in the output, then Postgres chooses the fast query (without needing to disable sequential scans):

orderbooks=> EXPLAIN SELECT b.id, b.run__id, b.time, b.venue, b.base, b.quote, o.book__id, o.is_buy, o.price FROM books b JOIN orders o ON o.book__id = b.id WHERE b.run__id = 1;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Nested Loop  (cost=1.14..6473136.93 rows=762939 width=45)
   ->  Index Scan using run_books_index on books b  (cost=0.43..2448.52 rows=1393 width=32)
         Index Cond: (run__id = 1)
   ->  Index Only Scan using orders_pkey on orders o  (cost=0.70..3766.96 rows=87819 width=13)
         Index Cond: (book__id = b.id)
(5 rows)

Domain

The data structure I'm trying to model is a limit order book. A row in the books table contains information about a single order book, while a row in the orders table describes a single order present in the given book.

I have chosen to not allow multiple orders with the same price for the same order book — thus the (book__id, is_buy, price) primary key on orders. The reason is that I don't need to distinguish between multiple different orders (e.g. from different people) of the same price. So if my input data, for a given order book, contains two same-priced orders, I simply convert it into a single order whose quantity is the sum of the two order quantities.

Versions

PostgreSQL 9.6.19 on x86_64-pc-linux-gnu, compiled by Debian clang version 10.0.1, 64-bit

Best Answer

There seem to be several things off:

  • The estimates are wrong. Since you already ANALYZEd both tables, retry with a higher default_statistics_target.

  • Since index scans seem to be estimated too expensive, perhaps your setting of random_page_cost is too high. Also, perhaps effective_cache_size is too low.

If you didn't select everything from books, a covering index might be an idea...