In PostgreSQL 9.3, I am attempting to create an efficient index on a rarely-used (0.00001% of total records) boolean column. To that end, I discovered this post on SO: https://stackoverflow.com/a/12026593/808921
I am attempting to make use of the "partial index" feature of PostgreSQL that Erwin Brandstetter recommends. I have a table with a couple million records already, and I would like to add the index to that table, like so:
CREATE INDEX schema_defs_deprovision ON schema_defs (deprovision)
WHERE deprovision = 0;
(the vast majority of the records will have deprovision = 1
)
The problem is that when I try to use this index with the simplest of queries, PostgreSQL acts as if it isn't there:
explain select * from schema_defs where deprovision = 0;
Seq Scan on schema_defs (cost=0.00..1.05 rows=1 width=278)
Filter: (deprovision = 0)
The really strange thing is that I have found that if this index is created before there is data in the table, then it does actually work fine. Don't believe me? Here are some SQL Fiddle entries that prove it:
Partial Index Created After Inserts (Index not working)
Partial Index Created Before Inserts (Index working properly)
In both of those, just expand the "View Execution Plan" link to see what I'm talking about.
So, my question is – what do I have to do in order to get PostgreSQL to start using a partial index on a table that had data in it before the index was created?
BTW I'm also the developer of SQL Fiddle and this question is related to a new development effort I'm making for it.
Best Answer
Run
ANALYZE
after the index has been added. And make sure the columndeprovision
has statistics. How to verify?Basic statistics in
pg_class
:Data histograms per column in
pg_stats
(pg_statistics
):The manual:
In your case, analyzing just the one column would do the job:
While being at it, it makes no sense to have the index on the column
deprovision
. Given the predicateWHERE deprovision = 0
it does not carry additional information. You might as well use a constant expression:Just a proof of concept. This would not be any more useful. In this special case you wouldn't need an index column at all, but you must provide at least one column or expression. So use the primary key (since it does not change and is indexed anyway, you don't introduce more restrictions / overhead costs) or any other small column (<= 8 bytes) that might be useful for queries.
sqlfiddle.com
The demo fiddles are misleading.
Your demo table only has 4 rows. Postgres should not be using the index. A similar problem, just the other way round. Postgres does not have statistics on the table immediately after creating it - until the first run of
ANALYZE
. Then it knows there are only 4 rows and won't touch the index any more.So why does it work properly on your second demo? The manual:
Bold emphasis mine. If you create the index after inserting rows, those basic statistics in
pg_class
are updated. But only those, not the detailed statistics inpg_statistic
:To make Postgres use the partial index (in particular in it's original form that is not useful for anything else) you also need the data histogram in
pg_statistic
informing the query planner thatdeprovision = 0
is actually a rare case, so it will pay to use the index.Autovacuum takes care of this. It schedules
VACUUM
andANALYZE
automatically. But there is a time frame (depends on settings and load) between writes to the table and the nextANALYZE
run. If you run queries immediately after table creation or changes to the table, those latest changes are not reflected in the statistics, yet. Never mind if that does not change statistics in a relevant way. If it does, for instance after a largeINSERT
or immediately after table creation, runANALYZE
manually to get proper query plans.Note that temporary tables are not covered by autovacuum at all. You always need to run
ANALYZE
manually on those if you need it:I don't know how you configured autovacuum and whether / when you run
ANALYZE
manually. But I noticed in the past that sqlfiddle can be misleading due to missing / outdated statistics.I would be very interested how
ANALYZE
is handled behind the curtains on sqlfiddle. It may be best not to do anything special, but some info would be welcome. Maybe one basic webpage per available RDBMS version?Demo
I created an SQL Fiddle to demonstrate the effects of
CREATE INDEX
andANALYZE
on the various statistics.The effects show (at least) on the first run for me. May not be reproducible on later runs, you would have to create a new schema and run again.
First we see neither basic statistics in
pg_class
:Nor any entries for
deprovision
inpg_statistics
at all (no result).Postgres has no idea what's in the table and defaults to using the index - which is a bad choice!
After
CREATE INDEX
, we see basic statistics, but still no data histogram inpg_statistics
.After
ANALYZE
we see both.With proper statistics, Postgres now uses a sequential scan (good choice, even if there is an index - it would be more expensive for so few rows).