The fastest way I know of is a lookup in the system catalog view pg_roles
:
SELECT * FROM pg_roles WHERE rolname = 'postgres';
Strictly speaking, it would be even slightly faster to use the underlying table pg_authid
, but access to it is restricted to superusers for good reasons.
There is no object identifier type like for tables or types, which would allow a simple cast like 'mytable'::regclass
.
Be aware though, that OIDs are not stable across a dump / restore cycle. So, the OID
is no good for the use case!
In some databases I have a separate login
table with a serial primary key that I use for similar purposes. Maintained manually. And functions using it are prepared to occasionally not find a user in this table. A very basic and fast table:
CREATE TABLE users.login (
login_id serial PRIMARY KEY
,username text NOT NULL UNIQUE
);
When creating new users I use a plpgsql function that creates the new user in the system and enters it into my table at the same time. And I use this login_id
in many places. For instance I keep track of who made the last change to a row in most tables. I use this simple function:
CREATE OR REPLACE FUNCTION public.f_login_id()
RETURNS int AS
$func$
SELECT COALESCE((SELECT login_id FROM users.login
WHERE username = session_user::text), 0::int)
$func$ LANGUAGE sql STABLE;
No foreign key constraints so to keep things fast and simple. Obviously, I don't need strict referential integrity ...
BRIN index
Available since Postgres 9.5 and probably just what you are looking for. Much faster index creation, much smaller index. But queries are typically not as fast. The manual:
BRIN stands for Block Range Index. BRIN is designed for handling very
large tables in which certain columns have some natural correlation
with their physical location within the table. A block range is a
group of pages that are physically adjacent in the table; for each
block range, some summary info is stored by the index.
Read on, there is more.
Depesz ran a preliminiary test.
The optimum for your case: If you can write rows clustered on run_id
, your index becomes very small and creation much cheaper.
CREATE INDEX foo ON run.perception USING brin (run_id, frame)
WHERE run_id >= 266 AND run_id <= 270;
You might even just index the whole table.
Table layout
Whatever else you do, you can save 8 bytes lost to padding due to alignment requirements per row by ording columns like this:
CREATE TABLE run.perception(
id bigint NOT NULL PRIMARY KEY
, run_id bigint NOT NULL
, frame bigint NOT NULL
, by_anyone bigint NOT NULL
, by_me bigint NOT NULL
, owning_p_id bigint NOT NULL
, subj_id bigint NOT NULL
, subj_state_frame bigint NOT NULL
, obj_type_set bigint
, by_s_id integer
, seq integer
, by varchar(45) NOT NULL -- or just use type text
);
Makes your table 79 GB smaller if none of the columns has NULL values. Details:
Also, you only have three columns that can be NULL. The NULL bitmap occupies 8 bytes for 9 - 72 columns. If only one integer column is NULL, there is a corner case for a storage paradox: it would be cheaper to use a dummy value instead: 4 bytes wasted but 8 bytes saved by not needing a NULL bitmap for the row. More details here:
Partial indexes
Depending on your actual queries it might be more efficient to have these five partial indices instead of the one above:
CREATE INDEX perception_run_id266_idx ON run.perception(frame) WHERE run_id = 266;
CREATE INDEX perception_run_id266_idx ON run.perception(frame) WHERE run_id = 267;
CREATE INDEX perception_run_id266_idx ON run.perception(frame) WHERE run_id = 268;
CREATE INDEX perception_run_id266_idx ON run.perception(frame) WHERE run_id = 269;
CREATE INDEX perception_run_id266_idx ON run.perception(frame) WHERE run_id = 270;
Run one transaction for each.
Removing run_id
as index column this way saves 8 bytes per index entry - 32 instead of 40 bytes per row. Each index is also cheaper to create, but creating five instead of just one takes substantially longer for a table that's too big to stay in cache (like @Jürgen and @Chris commented). So that may or may not be useful for you.
Partitioning
Based on inheritance - the only option up to Postgres 9.5.
(The new declarative partitioning in Postgres 11 or, preferably, 12 is smarter.)
The manual:
All constraints on all children of the parent table are examined
during constraint exclusion, so large numbers of partitions are likely
to increase query planning time considerably. So the legacy inheritance
based partitioning will work well with up to perhaps a hundred partitions;
don't try to use many thousands of partitions.
Bold emphasis mine. Consequently, estimating 1000 different values for run_id
, you would make partitions spanning around 10 values each.
maintenance_work_mem
I missed that you are already adjusting for maintenance_work_mem
in my first read. I'll leave quote and advice in my answer for reference. Per documentation:
maintenance_work_mem
(integer)
Specifies the maximum amount of memory to be used by maintenance
operations, such as VACUUM
, CREATE INDEX
, and ALTER TABLE ADD FOREIGN KEY
. It defaults to 64 megabytes (64MB
). Since only one of these
operations can be executed at a time by a database session, and an
installation normally doesn't have many of them running concurrently,
it's safe to set this value significantly larger than work_mem
. Larger
settings might improve performance for vacuuming and for restoring
database dumps.
Note that when autovacuum
runs, up to autovacuum_max_workers
times this memory may be allocated, so be careful not to set the default
value too high. It may be useful to control for this by separately
setting autovacuum_work_mem
.
I would only set it as high as needed - which depends on the unknown (to us) index size. And only locally for the executing session. As the quote explains, a too-high general setting can starve the server otherwise, because autovacuum may claim more RAM, too. Also, don't set it much higher than needed, even in the executing session, free RAM might be put to good use in caching data.
It could look like this:
BEGIN;
SET LOCAL maintenance_work_mem = 10GB; -- depends on resulting index size
CREATE INDEX perception_run_frame_idx_run_266_thru_270 ON run.perception(run_id, frame)
WHERE run_id >= 266 AND run_id <= 270;
COMMIT;
About SET LOCAL
:
The effects of SET LOCAL
last only till the end of the current
transaction, whether committed or not.
To measure object sizes:
The server should generally be configured reasonably otherwise, obviously.
Best Answer
Your current idea of using a range query on the primary key is probably the best bet. I don't see how partial indexes could help with that. You would need a series of partial indexes which would in aggregate be a total index, and with BTREE indexes used for range queries that would be pointless. But, the success of this method is likely to depend on the rows of your table being physically approximately ordered by the primary key value. You can enforce this ordering by using the CLUSTER the command, but that is itself a very expensive operation to do.
If you want to get PostgreSQL to parallelize this for you, you would have use a dummy query which selects everything:
and you will probably also need to dramatically lower the "parallel_tuple_cost setting", perhaps all the way to zero. But this is not likely to give you an actual improvement, because the bottleneck (if it is not disk IO) will be in converting your data from its internal binary format to the text format that is going to be output by the COPY. This conversion is always done in the leader process, not the parallel workers.
You can kind of get around this, by writing your queries so the heavy lifting is explicitly done in the query itself:
Now the lead process still needs to scan all that text for things which need to be escaped/quoted, but at least it doesn't need to convert JSONB to text for all the rows by itself. In my hands, using 7 parallel workers (8 processes including the leader) cuts the time required to
COPY...TO
in half. The overhead then becomes the communication from the parallel workers to the lead process. The way to remove that is to go back your original suggestions of indexed range queries in separate sessions.