PostgreSQL – Best Index for Column with Similar Values

indexpostgresqlpostgresql-9.6

We have an integer column that currently consists only of 0 or 1 values. This column has now been used by a developer to store a unique 32-bit identifier on some occasions, and we need to be able to efficiently pull out rows containing any one of these identifiers.

Given the value will be 0 or 1 say (I don't have figures yet) 99% of the time, how might it best be indexed to query against the minority case? Am I even right in thinking the volume of common values will be an issue?

           Column           |  Type   |     Modifiers
----------------------------+---------+--------------------
 event_value                | integer | not null

There are currently no indexes on this column. And I don't envisage the need to regularly select just the 0 or 1 values.

The table is of a reasonable size, currently 30 million rows and growing fast.

I appreciate this isn't the best use of the column, but that can't change in the short term.

Best Answer

First off, like you said yourself, not the best use of the column. Should be a separate boolean and an integer column for your "32-bit identifiers". If that's NULL 99% of the time, that is no problem. NULL storage is very cheap.

Either way, you should definitely use a partial index. (That's the proper term as used in the manual.) Excluding 99 % of the rows from the index makes it massively smaller, which matters for performance with millions of rows.

However, if you have a complete index on event_value anyway, and your common queries are retrieving single rows like:

SELECT * FROM tbl WHERE event_value = 123;

... then an additional partial index won't buy much. It would still be used as it's still a bit faster, but not much faster than a complete index. And the costs for an additional index may outweigh the benefits.

While the rare values are "32-bit identifiers", it may be incorrect to assume those are all > 1. Postgres uses signed integer, and 32-bit entities would also cover negative numbers. (Can we even rule out 0 or 1 as one of those identifiers?) If there can be negative values, too:

CREATE INDEX tbl_event_value_part_idx ON tbl (event_value)
WHERE event_value > 1 OR event_value < 0; -- or similar

event_value does not have to be an index column, regardless of its use in the WHERE clause. That entirely depends on the kinds of queries to expect. Either way, the safe bet is to add the same WHERE conditions literally to any query supposed to use the index, even if that's logically redundant. Postgres can make very basic logical conclusions to determine applicable indexes, but it is no AI and does not try to be (would get too expensive quickly). Like:

SELECT * FROM tbl WHERE event_value > 1 OR event_value < 0

Mongodb – How do databases store index key values (on-disk) for variable length fields

You can store your index as a list of fixed-size offsets into the block containing your key data. For example:

+--------------+
| 3            | number of entries
+--------------+
| 16           | offset of first key data
+--------------+
| 24           | offset of second key data
+--------------+
| 39           | offset of third key data
+--------------+
| key one |
+----------------+
| key number two |
+-----------------------+
| this is the third key |
+-----------------------+

(well, the key data would be sorted in a real example, but you get the idea).

Note that this does not necessarily reflect how index blocks are actually constructed in any database. This is merely an example of how you might organise a block of index data where the key data is of variable length.

Numbering Rows Consecutively for Multiple Tables in PostgreSQL

If I understand correctly, you want to restart the numbering with 0 for every table.
Use the window function row_number() in an UPDATE:

UPDATE tbl t
SET    xid = n.xid
FROM  (SELECT ctid, row_number() OVER (ORDER BY aid, bid, cid) - 1 AS xid FROM tbl) n
WHERE  t.ctid = n.ctid;

Using ctid as poor man's surrogate for a primary key, since you neglected to disclose your table definition.

SQL Fiddle.
db<>fiddle here

Best Answer

Related Solutions

Mongodb – How do databases store index key values (on-disk) for variable length fields

Numbering Rows Consecutively for Multiple Tables in PostgreSQL

Related Question