Postgresql – Select pseudorandom row from a very small partial index in Postgres

postgresql

I have a huge table, with a tiny partial index on <0.1% of the rows satisfying a certain condition (one column is null). I would like to pseudorandomly select a row that satisfies this condition. Can I use the partial index that already exists to speed this up, and if so, how?

Best Answer

Shouldn't be a problem, just put the WHERE condition of your partial index into the WHERE condition of your query.

create table foobar as select id, case when random()<0.001 then NULL else random() END as nullable, random() as z from generate_series(1,1000000) f(id);
create index on foobar (id ) where nullable is null;
select * from foobar 
   where nullable is null order by random() limit 1;

This does have to read all the indexed rows, so there is only fast it can be. It also has to "sort" all the rows, but since the sort knows about the LIMIT it is not a full sort and so is not really Nlog(N).

If you are willing to augment your table with a column of random numbers (which I had already done in my example, just for filler at the time), then you can do better.

create table foobar as select id, case when random()<0.001 then NULL else random() END as nullable, random() as z from generate_series(1,50000000) f(id);
create index on foobar (z) where nullable is null;

with rand as (select random()/1.0005 as rand) 
select foobar.* from foobar,rand where nullable is null
  and z between rand and rand+ 0.0005 
order  by random() limit  1;

You would have to tune the constant 0.0005 with knowledge of how many indexed rows there are. If you make it too small, there is the possibility that the range will contain not rows and so you get no result (you could then retry it), and if you make it too large you use more time than necessary.

Without changing the table, then I think you can't do much better than reading all of the indexed tuples; without being willing to compromise on the quality of the randomness.

BRIN index

Available since Postgres 9.5 and probably just what you are looking for. Much faster index creation, much smaller index. But queries are typically not as fast. The manual:

BRIN stands for Block Range Index. BRIN is designed for handling very large tables in which certain columns have some natural correlation with their physical location within the table. A block range is a group of pages that are physically adjacent in the table; for each block range, some summary info is stored by the index.

Read on, there is more.
Depesz ran a preliminiary test.

The optimum for your case: If you can write rows clustered on run_id, your index becomes very small and creation much cheaper.

CREATE INDEX foo ON run.perception USING brin (run_id, frame)
WHERE run_id >= 266 AND run_id <= 270;

You might even just index the whole table.

Table layout

Whatever else you do, you can save 8 bytes lost to padding due to alignment requirements per row by ording columns like this:

CREATE TABLE run.perception(
  id               bigint NOT NULL PRIMARY KEY
, run_id           bigint NOT NULL
, frame            bigint NOT NULL
, by_anyone        bigint NOT NULL
, by_me            bigint NOT NULL
, owning_p_id      bigint NOT NULL
, subj_id          bigint NOT NULL
, subj_state_frame bigint NOT NULL
, obj_type_set     bigint
, by_s_id          integer
, seq              integer
, by               varchar(45) NOT NULL -- or just use type text
);

Makes your table 79 GB smaller if none of the columns has NULL values. Details:

Configuring PostgreSQL for read performance

Also, you only have three columns that can be NULL. The NULL bitmap occupies 8 bytes for 9 - 72 columns. If only one integer column is NULL, there is a corner case for a storage paradox: it would be cheaper to use a dummy value instead: 4 bytes wasted but 8 bytes saved by not needing a NULL bitmap for the row. More details here:

How do completely empty columns in a large table affect performance?

Partial indexes

Depending on your actual queries it might be more efficient to have these five partial indices instead of the one above:

CREATE INDEX perception_run_id266_idx ON run.perception(frame) WHERE run_id = 266;
CREATE INDEX perception_run_id266_idx ON run.perception(frame) WHERE run_id = 267;
CREATE INDEX perception_run_id266_idx ON run.perception(frame) WHERE run_id = 268;
CREATE INDEX perception_run_id266_idx ON run.perception(frame) WHERE run_id = 269;
CREATE INDEX perception_run_id266_idx ON run.perception(frame) WHERE run_id = 270;

Run one transaction for each.

Removing run_id as index column this way saves 8 bytes per index entry - 32 instead of 40 bytes per row. Each index is also cheaper to create, but creating five instead of just one takes substantially longer for a table that's too big to stay in cache (like @Jürgen and @Chris commented). So that may or may not be useful for you.

Partitioning

Based on inheritance - the only option up to Postgres 9.5.
_{(The new declarative partitioning in Postgres 11 or, preferably, 12 is smarter.)}

The manual:

All constraints on all children of the parent table are examined during constraint exclusion, so large numbers of partitions are likely to increase query planning time considerably. So the legacy inheritance based partitioning will work well with up to perhaps a hundred partitions; don't try to use many thousands of partitions.

Bold emphasis mine. Consequently, estimating 1000 different values for run_id, you would make partitions spanning around 10 values each.

`maintenance_work_mem`

I missed that you are already adjusting for maintenance_work_mem in my first read. I'll leave quote and advice in my answer for reference. Per documentation:

maintenance_work_mem (integer)

Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. It defaults to 64 megabytes (64MB). Since only one of these operations can be executed at a time by a database session, and an installation normally doesn't have many of them running concurrently, it's safe to set this value significantly larger than work_mem. Larger settings might improve performance for vacuuming and for restoring database dumps.

Note that when autovacuum runs, up to autovacuum_max_workers times this memory may be allocated, so be careful not to set the default value too high. It may be useful to control for this by separately setting autovacuum_work_mem.

I would only set it as high as needed - which depends on the unknown (to us) index size. And only locally for the executing session. As the quote explains, a too-high general setting can starve the server otherwise, because autovacuum may claim more RAM, too. Also, don't set it much higher than needed, even in the executing session, free RAM might be put to good use in caching data.

It could look like this:

BEGIN;

SET LOCAL maintenance_work_mem = 10GB;  -- depends on resulting index size

CREATE INDEX perception_run_frame_idx_run_266_thru_270 ON run.perception(run_id, frame)
WHERE run_id >= 266 AND run_id <= 270;

COMMIT;

About SET LOCAL:

The effects of SET LOCAL last only till the end of the current transaction, whether committed or not.

To measure object sizes:

Measure the size of a PostgreSQL table row

The server should generally be configured reasonably otherwise, obviously.

PostgreSQL Index Tuning – Make Unique Check Without Heavy Index

There's no way to avoid creating a unique index if you want to efficiently ensure uniqueness for the data. The unique index is necessary for PostgreSQL to enforce uniqueness in the face of concurrent inserts, updates, and deletes.

A unique index is what backs a unique constraint; when you create a unique constraint, a unique index is created for you automatically.

The only other way to enforce uniqueness is to LOCK TABLE mytable IN EXCLUSIVE MODE before any change to the table that might affect the column you wish to make unique. That way you don't have to create a unique index, but in exchange you can only have one transaction changing the table at a time. You could do this from a trigger only when it detects a change in the unique column, but then you'd suffer from lock upgrades that would lead to frequent deadlocks.

Really, a unique index is the way to go here.

You can mitigate the impact of the required lock by using CREATE UNIQUE INDEX ... CONCURRENTLY then using the version of ALTER TABLE ... ADD CONSTRAINT ... that specifies an already-created index to use for the UNIQUE constraint.

Best Answer

Related Solutions

PostgreSQL Performance – Speeding Up Creation of Partial Index