We have an integer column that currently consists only of 0 or 1 values. This column has now been used by a developer to store a unique 32-bit identifier on some occasions, and we need to be able to efficiently pull out rows containing any one of these identifiers.
Given the value will be 0 or 1 say (I don't have figures yet) 99% of the time, how might it best be indexed to query against the minority case? Am I even right in thinking the volume of common values will be an issue?
Column | Type | Modifiers
----------------------------+---------+--------------------
event_value | integer | not null
There are currently no indexes on this column. And I don't envisage the need to regularly select just the 0 or 1 values.
The table is of a reasonable size, currently 30 million rows and growing fast.
I appreciate this isn't the best use of the column, but that can't change in the short term.
Best Answer
First off, like you said yourself, not the best use of the column. Should be a separate
boolean
and aninteger
column for your "32-bit identifiers". If that'sNULL
99% of the time, that is no problem.NULL
storage is very cheap.Either way, you should definitely use a partial index. (That's the proper term as used in the manual.) Excluding 99 % of the rows from the index makes it massively smaller, which matters for performance with millions of rows.
However, if you have a complete index on
event_value
anyway, and your common queries are retrieving single rows like:... then an additional partial index won't buy much. It would still be used as it's still a bit faster, but not much faster than a complete index. And the costs for an additional index may outweigh the benefits.
While the rare values are "32-bit identifiers", it may be incorrect to assume those are all > 1. Postgres uses signed integer, and 32-bit entities would also cover negative numbers. (Can we even rule out
0
or1
as one of those identifiers?) If there can be negative values, too:event_value
does not have to be an index column, regardless of its use in the WHERE clause. That entirely depends on the kinds of queries to expect. Either way, the safe bet is to add the sameWHERE
conditions literally to any query supposed to use the index, even if that's logically redundant. Postgres can make very basic logical conclusions to determine applicable indexes, but it is no AI and does not try to be (would get too expensive quickly). Like:Related: