PostgreSQL Indexing – Partial Index Unused When Created on a Table with Existing Data

execution-planindexpostgresqlstatistics

In PostgreSQL 9.3, I am attempting to create an efficient index on a rarely-used (0.00001% of total records) boolean column. To that end, I discovered this post on SO: https://stackoverflow.com/a/12026593/808921

I am attempting to make use of the "partial index" feature of PostgreSQL that Erwin Brandstetter recommends. I have a table with a couple million records already, and I would like to add the index to that table, like so:

CREATE INDEX schema_defs_deprovision ON schema_defs (deprovision) 
WHERE deprovision = 0;

(the vast majority of the records will have deprovision = 1)

The problem is that when I try to use this index with the simplest of queries, PostgreSQL acts as if it isn't there:

explain select * from schema_defs where deprovision = 0;

Seq Scan on schema_defs (cost=0.00..1.05 rows=1 width=278)
Filter: (deprovision = 0)

The really strange thing is that I have found that if this index is created before there is data in the table, then it does actually work fine. Don't believe me? Here are some SQL Fiddle entries that prove it:

Partial Index Created After Inserts (Index not working)

Partial Index Created Before Inserts (Index working properly)

In both of those, just expand the "View Execution Plan" link to see what I'm talking about.

So, my question is – what do I have to do in order to get PostgreSQL to start using a partial index on a table that had data in it before the index was created?

BTW I'm also the developer of SQL Fiddle and this question is related to a new development effort I'm making for it.

Best Answer

Run ANALYZE after the index has been added. And make sure the column deprovision has statistics. How to verify?

Basic statistics in pg_class:

SELECT relname, relkind, reltuples, relpages
FROM   pg_class
WHERE  oid = 'schema_defs'::regclass;

Data histograms per column in pg_stats (pg_statistics):

SELECT attname, inherited, n_distinct
     , array_to_string(most_common_vals, E'\n') AS most_common_vals
FROM   pg_stats
WHERE  tablename = 'schema_defs'
AND    attname = 'deprovision';

The manual:

The PostgreSQL query planner relies on statistical information about the contents of tables in order to generate good plans for queries. These statistics are gathered by the ANALYZE command, which can be invoked by itself or as an optional step in VACUUM. It is important to have reasonably accurate statistics, otherwise poor choices of plans might degrade database performance.

The autovacuum daemon, if enabled, will automatically issue ANALYZE commands whenever the content of a table has changed sufficiently. However, administrators might prefer to rely on manually-scheduled ANALYZE operations, particularly if it is known that update activity on a table will not affect the statistics of "interesting" columns. The daemon schedules ANALYZE strictly as a function of the number of rows inserted or updated; it has no knowledge of whether that will lead to meaningful statistical changes.

In your case, analyzing just the one column would do the job:

ANALYZE table_name (deprovision);

While being at it, it makes no sense to have the index on the column deprovision. Given the predicate WHERE deprovision = 0 it does not carry additional information. You might as well use a constant expression:

CREATE INDEX schema_defs_deprovision ON schema_defs ((true)) 
WHERE deprovision = 0;

Just a proof of concept. This would not be any more useful. In this special case you wouldn't need an index column at all, but you must provide at least one column or expression. So use the primary key (since it does not change and is indexed anyway, you don't introduce more restrictions / overhead costs) or any other small column (<= 8 bytes) that might be useful for queries.

CREATE INDEX schema_defs_deprovision ON schema_defs (id) 
WHERE deprovision = 0;

sqlfiddle.com

The demo fiddles are misleading.

Partial Index Created Before Inserts (Index working properly)

Your demo table only has 4 rows. Postgres should not be using the index. A similar problem, just the other way round. Postgres does not have statistics on the table immediately after creating it - until the first run of ANALYZE. Then it knows there are only 4 rows and won't touch the index any more.
So why does it work properly on your second demo? The manual:

For efficiency reasons, reltuples and relpages are not updated on-the-fly, and so they usually contain somewhat out-of-date values. They are updated by VACUUM, ANALYZE, and a few DDL commands such as CREATE INDEX.

Bold emphasis mine. If you create the index after inserting rows, those basic statistics in pg_class are updated. But only those, not the detailed statistics in pg_statistic:

Entries in pg_statistic are updated by the ANALYZE and VACUUM ANALYZE commands, and are always approximate even when freshly updated.

To make Postgres use the partial index (in particular in it's original form that is not useful for anything else) you also need the data histogram in pg_statistic informing the query planner that deprovision = 0 is actually a rare case, so it will pay to use the index.

Autovacuum takes care of this. It schedules VACUUM and ANALYZE automatically. But there is a time frame (depends on settings and load) between writes to the table and the next ANALYZE run. If you run queries immediately after table creation or changes to the table, those latest changes are not reflected in the statistics, yet. Never mind if that does not change statistics in a relevant way. If it does, for instance after a large INSERT or immediately after table creation, run ANALYZE manually to get proper query plans.

Note that temporary tables are not covered by autovacuum at all. You always need to run ANALYZE manually on those if you need it:

I don't know how you configured autovacuum and whether / when you run ANALYZE manually. But I noticed in the past that sqlfiddle can be misleading due to missing / outdated statistics.

I would be very interested how ANALYZE is handled behind the curtains on sqlfiddle. It may be best not to do anything special, but some info would be welcome. Maybe one basic webpage per available RDBMS version?

Demo

I created an SQL Fiddle to demonstrate the effects of CREATE INDEX and ANALYZE on the various statistics.

The effects show (at least) on the first run for me. May not be reproducible on later runs, you would have to create a new schema and run again.

First we see neither basic statistics in pg_class:

relname      reltuples  relpages
schema_defs  0          0

Nor any entries for deprovision in pg_statistics at all (no result).
Postgres has no idea what's in the table and defaults to using the index - which is a bad choice!

After CREATE INDEX, we see basic statistics, but still no data histogram in pg_statistics.

After ANALYZE we see both.

With proper statistics, Postgres now uses a sequential scan (good choice, even if there is an index - it would be more expensive for so few rows).