Postgresql – Postgres partial index on IS NULL not working

indexperformancepostgresqlpostgresql-10postgresql-performance

Postgres version

Using PostgreSQL 10.3.

Table definition

CREATE TABLE tickets (
  id bigserial primary key,
  state character varying,
  closed timestamp
);

CREATE INDEX  "state_index" ON "tickets" ("state")
  WHERE ((state)::text = 'open'::text));

Cardinality

The table contains 1027616 rows, with 51533 of the rows having state = 'open' and closed IS NULL, or 5%.

A query with a condition on state performs well using an index scan as expected:

explain analyze select * from tickets where state = 'open';

Index Scan using state_index on tickets  (cost=0.29..16093.57 rows=36599 width=212) (actual time=0.025..22.875 rows=37346 loops=1)
Planning time: 0.212 ms
Execution time: 25.697 ms

I am trying to achieve the same or better performance for the query with the condition closed IS NULL so that I can drop the state column and rely on the closed column for fetching the same rows. closed is null for the same rows where state = 'open', hence the state column is redundant.

select * from tickets where closed IS NULL;

However, none of the indexes I've tried result in a single index scan like the first query. Below are the indexes I've tried along with the EXPLAIN ANALYZE results.

A partial index:

CREATE INDEX  "closed_index" ON "tickets"  ("closed") WHERE (closed IS NULL)

explain analyze select * from tickets where closed IS NULL;

Bitmap Heap Scan on tickets  (cost=604.22..38697.91 rows=36559 width=212) (actual time=12.879..48.780 rows=37348 loops=1)
  Recheck Cond: (closed IS NULL)
  Heap Blocks: exact=14757
  ->  Bitmap Index Scan on closed_index  (cost=0.00..595.09 rows=36559 width=0) (actual time=7.585..7.585 rows=37348 loops=1)
Planning time: 4.831 ms
Execution time: 52.068 ms

An expression index:

CREATE INDEX  "closed_index" ON "tickets" ((closed IS NULL))

explain analyze select * from tickets where closed IS NULL;

Seq Scan on tickets  (cost=0.00..45228.26 rows=36559 width=212) (actual time=0.025..271.418 rows=37348 loops=1)
  Filter: (closed IS NULL)
  Rows Removed by Filter: 836578
Planning time: 7.992 ms
Execution time: 274.504 ms

A partial expression index:

CREATE INDEX  "closed_index" ON "tickets" ((closed IS NULL))
  WHERE (closed IS NULL);

explain analyze select * from tickets where closed IS NULL;

Bitmap Heap Scan on tickets  (cost=604.22..38697.91 rows=36559 width=212) (actual time=177.109..238.008 rows=37348 loops=1)
  Recheck Cond: (closed IS NULL)
  Heap Blocks: exact=14757
  ->  Bitmap Index Scan on "closed_index"  (cost=0.00..595.09 rows=36559 width=0) (actual time=174.598..174.598 rows=37348 loops=1)
Planning time: 23.063 ms
Execution time: 241.292 ms

UPDATED

Expanded table definition:

CREATE TABLE tickets (
  id bigserial primary key,
  state character varying,
  closed timestamp,
  created timestamp,
  updated timestamp,
  title character varying,
  size integer NOT NULL,
  comment_count integer NOT NULL
);

CREATE INDEX  "state_index" ON "tickets" ("state")
  WHERE ((state)::text = 'open'::text));

Cardinality:

The table contains 1027616 rows, with 51533 of the rows having state = 'open' and closed IS NULL, or 5%. As mentioned above, I am trying to drop the state column so I want to be able to fetch the same rows using a condition on the closed column instead.

Query with condition on state column uses index scan.

explain analyze select id, title, created, closed, updated from tickets where state = 'open';

Index Scan using state_index on tickets  (cost=0.29..22901.58 rows=49356 width=72) (actual time=0.107..49.599 rows=51533 loops=1)
Planning time: 0.511 ms
Execution time: 54.366 ms

I'd like the same performance (index scan, ideally) when switching to querying on the closed column. I tried the partial index on id and closed IS NULL:

CREATE INDEX closed_index ON tickets (id) WHERE closed IS NULL;

VACUUM ANALYZE tickets;

explain analyze select id, title, created, closed, updated from tickets where closed IS NULL;

Bitmap Heap Scan on tickets  (cost=811.96..33999.42 rows=49461 width=72) (actual time=7.868..47.080 rows=51537 loops=1)
  Recheck Cond: (closed IS NULL)
  Heap Blocks: exact=17479
  ->  Bitmap Index Scan on closed_index  (cost=0.00..799.60 rows=49461 width=0) (actual time=4.868..4.868 rows=51537 loops=1)
Planning time: 0.222 ms
Execution time: 51.028 ms

Best Answer

Assuming the central piece of information:

with ~15% of the rows having state = 'open' and closed IS NULL

is supposed to mean the same 15 % of all 1031584 rows meet both these conditions _{(all details matter!)}. Both conditions should perform equally - returning around 155k rows (!)

Your query plans show 37346 qualifying rows, ~ 3.6 % not 15 %. Something's still not right in your question.

With 3.6 %, indexes only start to make sense. Your tiny row size effectively occupies ~ 52 bytes per row, around 155 rows per page. That would be 5-6 hits per page for a completely random distribution. Postgres has to read all pages anyway and a sequential scan should be the fastest plan. Filtering the misses should be faster than involving indexes in any way.

Typically, qualifying rows are more or less clustered and the fewer data pages are involved the more sense it would make to involve an index. All bitmap index scans, though, I see hardly any case for an index scan. Even far less (as far as there can be much "less" than "hardly any") for the 15 % you claim.

For your updated numbers (~ 5% of all rows match) I would still rather expect bitmap index scans than index scans. A possible explanation: table bloat with lots of dead tuples. You mentioned a high write load. That would result in fewer and fewer live tuples per data page and favor index scans (as compared to bitmap index scans). You might re-test your initial query after a VACUUM FULL ANALYZE (if you can afford an exclusive lock on the table!). If my hypothesis holds, the physical table size would shrink substantially and you would then see a bitmap index scan instead of the index scan (and faster, too).

You may need more aggressive autovacuum settings. See:

Aggressive Autovacuum on PostgreSQL

Partial index

Your "expression index" and "partial expression index" are not useful. We don't need closed IS NULL as actual index expression (which is always true here). The expression only adds cost and no gain.

The first, plain partial index is the more useful variant. But don't use closed as index expression (again, always NULL here). Instead, use any column possibly useful to other queries, and ideally never updated to avoid additional cost and index bloat. The primary key column id is the natural candidate in absence of other useful applications:

CREATE INDEX closed_index ON tickets (id) WHERE closed IS NULL;

Or, if id is not useful, consider a constant instead:

CREATE INDEX closed_index ON tickets ((1)) WHERE closed IS NULL;

This makes the actual index column useless like in other dismissed variants - but it avoids all additional costs and dependencies. Related:

Get count estimates from pg_class.reltuples for given conditions

What I might try:

Updated to your updated question - only makes sense if you don't have many other writes on the the rows in question (The added columns updated and comments_count make me doubt it.)

*Create a partial index with id and the other relevant columns (few & small) as index expressions, and capitalize on it with a suitable query to get index-only scans:*

CREATE INDEX closed_index ON tickets (id, title, created, updated)
WHERE closed IS NULL;

VACUUM ANALYZE tickets;   -- just to prove idx-only is possible

SELECT id, title, created, updated
     , NULL::timestamp AS closed  -- redundant, rather drop it
FROM   tickets
WHERE  closed IS NULL;

We don't need SELECT *, closed IS NULL is given by the WHERE clause. So we can use the tiny partial index in a fast index-only scan - assuming you meet the preconditions (that's why I threw in VACUUM to update the visibility map up there). This is the rare case where queries reading more than ~ 5 % of all rows still happily use the index (even up to including the whole table).

There seems to be redundancy in your design, simplifications should be possible.

This works since Postgres 9.6, quoting the release notes:

Allow use of an index-only scan on a partial index when the index's WHERE clause references columns that are not indexed (Tomas Vondra, Kyotaro Horiguchi)

For example, an index defined by CREATE INDEX tidx_partial ON t(b) WHERE a > 0 can now be used for an index-only scan by a query that specifies WHERE a > 0 and does not otherwise use a. Previously this was disallowed because a is not listed as an index column.

Or the information in your question is misleading.

If you don't see index-only scans with this, even immediately after running VACUUM, then high write load may be in the way and the visibility map just never reaches a state that would allow index-only scans. The manual. Or you have another problem in your DB keeping VACUUM from doing its job. Related:

Postgres: “vacuum” command does not clean up dead tuples

Postgres version

Table definition

Cardinality

UPDATED

Expanded table definition:

Cardinality:

Best Answer

Partial index

What I might try:

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Postgresql – How to index WHERE (start_date >= ‘2013-12-15’)

Related Question