PostgreSQL – Best Index for Column with Similar Values

indexpostgresqlpostgresql-9.6

We have an integer column that currently consists only of 0 or 1 values. This column has now been used by a developer to store a unique 32-bit identifier on some occasions, and we need to be able to efficiently pull out rows containing any one of these identifiers.

Given the value will be 0 or 1 say (I don't have figures yet) 99% of the time, how might it best be indexed to query against the minority case? Am I even right in thinking the volume of common values will be an issue?

           Column           |  Type   |     Modifiers
----------------------------+---------+--------------------
 event_value                | integer | not null

There are currently no indexes on this column. And I don't envisage the need to regularly select just the 0 or 1 values.

The table is of a reasonable size, currently 30 million rows and growing fast.

I appreciate this isn't the best use of the column, but that can't change in the short term.

Best Answer

First off, like you said yourself, not the best use of the column. Should be a separate boolean and an integer column for your "32-bit identifiers". If that's NULL 99% of the time, that is no problem. NULL storage is very cheap.

Either way, you should definitely use a partial index. (That's the proper term as used in the manual.) Excluding 99 % of the rows from the index makes it massively smaller, which matters for performance with millions of rows.

However, if you have a complete index on event_value anyway, and your common queries are retrieving single rows like:

SELECT * FROM tbl WHERE event_value = 123;

... then an additional partial index won't buy much. It would still be used as it's still a bit faster, but not much faster than a complete index. And the costs for an additional index may outweigh the benefits.

While the rare values are "32-bit identifiers", it may be incorrect to assume those are all > 1. Postgres uses signed integer, and 32-bit entities would also cover negative numbers. (Can we even rule out 0 or 1 as one of those identifiers?) If there can be negative values, too:

CREATE INDEX tbl_event_value_part_idx ON tbl (event_value)
WHERE event_value > 1 OR event_value < 0; -- or similar

event_value does not have to be an index column, regardless of its use in the WHERE clause. That entirely depends on the kinds of queries to expect. Either way, the safe bet is to add the same WHERE conditions literally to any query supposed to use the index, even if that's logically redundant. Postgres can make very basic logical conclusions to determine applicable indexes, but it is no AI and does not try to be (would get too expensive quickly). Like:

SELECT * FROM tbl WHERE event_value > 1 OR event_value < 0

Related: