I have a table, account_balance
, with primary key (address, timestamp, currency)
. This table is pretty big, having nearly 1 billion rows and taking ~3T of space. I am doing a simple query on it:
SELECT address, EXISTS(SELECT address FROM account_balance WHERE address = a.address) ex
FROM (VALUES ('ab566dceac0d857dbf05682d1ceb548667cf4580')) AS a(address);
This should be a simple index lookup but instead Postges produces a seq scan. Here's explain result:
Result (cost=0.37..0.38 rows=1 width=33)
InitPlan 1 (returns $0)
-> Seq Scan on account_balance (cost=0.00..355882209.06 rows=965801108 width=0)
Filter: ((address)::text = 'ab566dceac0d857dbf05682d1ceb548667cf4580'::text)
If I set set enable_seqscan = off
then as expected:
Result (cost=0.99..1.00 rows=1 width=33)
InitPlan 1 (returns $0)
-> Index Only Scan using account_balance_pkey on account_balance (cost=0.83..153234426.09 rows=965801135 width=0)
Index Cond: (address = 'ab566dceac0d857dbf05682d1ceb548667cf4580'::text)
So the question is – why it chooses the seq scan? It should never make sense to prefer seq scan to index lookup on such a huge table. What could compel it to choose to go sequentially over a 3T table (which of course takes forever) instead of going for the index (which takes microseconds)? Is some parameter wrong or some system variable I need to check? I tried SET random_page_cost = 1
and for some cases it helps, but for others it still insists on using seq scan. Only setting enable_seqscan
to off
seems to fix it consistently, but I feel it's a wrong approach.
I am not sure how to interpret the costs numbers – I feel like they may be the key for it, but I can't figure out where they come from and why they are what they are.
This is on PostgreSQL 12.
P.S. Tried to do ANALYZE
, didn't change anything.
Best Answer
EXISTS
will stop executing as soon as a result is found, so if that happens early on, a sequential scan will be very fast. Obviously PostgreSQL estimates that very many rows inaccount_balance
have thataddress
.The cost estimate on the sequential scan is misleading, since it is for the complete scan. As you see, the final cost is estimated quite low.
Without
EXPLAIN (ANALYZE, BUFFERS)
it is impossible to say if PostgreSQL is right, but then you only asked for an explanation why it would choose such a plan.