PostgreSQL – Does It Always Sequentially Scan Pages in the Same Order?

cachepostgresql

Yet another question from my discovery of the great new EXPLAIN options in PostgreSQL. This one focuses on the BUFFERS option.

Here is EXPLAIN:

EXPLAIN (ANALYZE, BUFFERS) SELECT event_time FROM ui_events_v2  WHERE page ~ 'foo' LIMIT 1;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1539.68 rows=1 width=8) (actual time=0.858..0.858 rows=1 loops=1)
   Buffers: shared read=10
   I/O Timings: read=0.735
   ->  Seq Scan on ui_events_v2  (cost=0.00..3313394.58 rows=2152 width=8) (actual time=0.857..0.857 rows=1 loops=1)
         Filter: (page ~ 'foo'::text)
         Rows Removed by Filter: 112
         Buffers: shared read=10
         I/O Timings: read=0.735
 Planning Time: 6.455 ms
 Execution Time: 0.877 ms

It is quite fast – and this query is very slow on a cold start. This is a 30M row table, and seven rows have page containing substring foo. The first row matching this query would seem to show up 604k pages in, which would be ~20% of the ~3M pages mentioned in pg_class:

SELECT min(ctid) FROM ui_events_v2 WHERE page ~ 'foo';
     min
-------------
 (604435,10)

My assumption was that even if every page of this table were in the PostgreSQL or operating system cache, it still needs to walk some linear list of pages for a sequential scan. The section Buffers: shared read=10, and the 0.877 ms execution time suggests to me that it's somehow "picking up where it left off" and starting at pages that are far from the beginning. I thought it moved through page IDs and peeked into a cache trying for each page as it moved, but does it perhaps start instead from the cache itself? If that were the case, how does it discover the non-cached pages?

I know that rows aren't guaranteed to be found in any particular order, but I figured that within the same query strategy, the path followed would be relatively similar.

Best Answer

If multiple processes are trying to seq scan the same table, it will try to line them all up so that all read the same pages at about the same time, to maximize cache hits. It does this by having the late comers start in the middle (where ever the existing ones happen to be at), and wrap around the end to finish the table at the page before where it started. A side effect of this is that if one seq scan stops early, the next one will start up on the same page where the last one stopped. So if the entire LIMIT can be satisfied by rows from one page, a scan like yours will just scan the same page over and over.

You can turn this optimization off with set synchronize_seqscans=off, then they all start from the first block in the table.