I've got a PostgreSQL 9.3 table with some numbers and some additional data:
CREATE TABLE mytable (
myid BIGINT,
somedata BYTEA
)
This table currently has about 10M records and takes 1GB of disk space. myid
are not consecutive.
I want to compute how many rows are in every block of 100000 consecutive numbers:
SELECT myid/100000 AS block, count(*) AS total FROM mytable GROUP BY myid/100000;
This returns about 3500 rows.
I noticed that existence of a certain index significantly speeds up this query even though the query plan does not mention it at all. The query plan without the index:
db=> EXPLAIN (ANALYZE TRUE, VERBOSE TRUE) SELECT myid/100000 AS block, count(*) AS total FROM mytable GROUP BY myid/100000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1636639.92..1709958.65 rows=496942 width=8) (actual time=6783.763..8888.841 rows=3460 loops=1)
Output: ((myid / 100000)), count(*)
-> Sort (cost=1636639.92..1659008.91 rows=8947594 width=8) (actual time=6783.752..8005.831 rows=8947557 loops=1)
Output: ((myid / 100000))
Sort Key: ((mytable.myid / 100000))
Sort Method: external merge Disk: 157440kB
-> Seq Scan on public.mytable (cost=0.00..236506.92 rows=8947594 width=8) (actual time=0.020..1674.838 rows=8947557 loops=1)
Output: (myid / 100000)
Total runtime: 8914.780 ms
(9 rows)
The index:
db=> CREATE INDEX myindex ON mytable ((myid/100000));
db=> VACUUM ANALYZE;
The new query plan:
db=> EXPLAIN (ANALYZE TRUE, VERBOSE TRUE) SELECT myid/100000 AS block, count(*) AS total FROM mytable GROUP BY myid/100000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=281242.99..281285.97 rows=3439 width=8) (actual time=3190.189..3190.800 rows=3460 loops=1)
Output: ((myid / 100000)), count(*)
-> Seq Scan on public.mytable (cost=0.00..236505.56 rows=8947485 width=8) (actual time=0.026..1659.571 rows=8947557 loops=1)
Output: (myid / 100000)
Total runtime: 3190.975 ms
(5 rows)
So, the query plans and the runtimes differ significantly (almost three times) but neither mention the index. This behavior is perfectly reproducible on my dev machine: I went through several cycles of dropping the index, testing the query several times, recreating the index, again testing the query several times. What's happening here?
Best Answer
VACUUM ANALYZE
makes the difference in your example. Plus, as @jjanes supplied, the additional statistics for the functional index. Per documentation:However, creating the index does not by itself cause Postgres to gather statistics. Try:
Returns nothing until you run your first
ANALYZE
(orVACUUM ANALYZE
, or the autovacuum daemon kicks in).Now you'll see added statistics.
Since the whole table has to be read anyway, Postgres is going to use a sequential scan unless it expects the computation of
myid/100000
to be expensive enough to switch, which it isn't.Your only other chance would be an index-only scan if the index is much smaller than the table - and preconditions for an index-only scan are met. Details in the Postgres Wiki and in the manual.
As long as that functional index is not used, the collateral benefit from added statistics is moderate. If the table was read-only the cost would be low - but then again, we'd probably see an index-only scan right away.
Maybe you can also achieve better query plans by setting a higher statistics target for
mytable.myid
. That would only incur a minor cost. More: