Postgres – Does Updating a Partially Indexed Field Affect Index?

indexindex-tuningperformancepostgresql

We can create a partial index for a field in Postgres.

I am looking at using this to stop rows which are archived from being in an index on a created_at field:

  1. First we create the index:

    test=# CREATE INDEX idx_my_table_created_at ON my_table(created_at) WHERE NOT(archived);
    
  2. We insert two rows, one which gets in idx_my_table_created_at and one which does not:

     id |         created_at         | archived
    ----+----------------------------+----------
      2 | 2012-10-12 17:39:17.28511  | t
      1 | 2012-10-12 17:38:55.531278 | f
    
  3. Now we UPDATE a row which is presently in the index, such that it shouldn't be in the index:

    test=# UPDATE my_table SET archived = true WHERE id = 1;
    

My question: Is id = 1 still in idx_my_table_created_at?

This question follows on from an answer to a previous question I asked on how to handle rows which won't be queried once they are old, hence archived.

Best Answer

(Note: I'm not a deep expert in the Pg guts. This is only my understanding. You should really read the mvcc and internals sections of the manual).

Yes, initially the row with id = 1 is still in the index, despite the index's WHERE clause excluding it because it has been updated with archived = true. At some point down the track the index entry will be purged by VACUUM or autovacuum, freeing the space.

The reason for this is PostgreSQL's MVCC design. Other concurrent transactions might still be able to see the old version of the row with id = 1, archived = false - either READ COMMITTED transactions with a long-running statement, or SERIALIZABLE transactions that don't see changes committed after they began. If the index entry were removed when the row was updated those transactions wouldn't find the row in index scans and would produce the wrong result.

Technically what's actually happening is that the old row still exists, it's just had an xmax set so newer transactions ignore it when they see it. After the update the index points to that row, same as before. A new row with archived = true has been inserted with an xmin above the old row's xmax, so any given transaction either sees the old or new version, never both. That new row is not added to the index.

Eventually there will be no open transaction that can still see the old row, so VACUUM will come by and remove the old row and its index entry, freeing the space for re-use.

Note that the index its self has no row visibility information. You can - and often do - have many versions of a row, each with index entries. When a transaction looks for matching index entries it reads all of the versions (unless it can tell from the visibility map that it doesn't need to) and ignores all but the one that has an xmin and xmax that makes it visible to that transaction.

This is one of many good reasons to ensure that autovaccum runs frequently. It reduces the number of dead index entries and dead rows you have lying around wasting space and I/O.