Suppose I've a table named agency
with some columns:
internal_id(integer, unique)
, external_id(bigint, unique)
, name, location, created_at, ...
internal_id
and external_id
are each one unique and candidates for being as the primary key column.
There are some other tables (says A, B, C, D, E
) that reference to this table. Suppose each of these tables may contains millions or billions of rows.
Normally I have the external_id
when I need to filter the tables A, B, C, D, E
data.
Which of the following scenarios are the best way to go, considering performance and storage space:
- Use
internal_id
as primary key inagency
, and as foreign key in other tables. Because this field takes 4 bytes of storage space, we can save billion of bytes. However as I normally have theexternal_id
, I have to do an extraJOIN
for each query as a penalty:
SELECT A.* FROM A
INNER JOIN agency ON A.internal_id=agency.internal_id
WHERE agency.external_id=5;
- Use
internal_id
as primary key inagency
, and as foreign key in other tables. But to get rid of an extraJOIN
, in my application I could first mapexternal_id
tointernal_id
with a simple query (SELECT internal_id FROM agency WHERE external_id=5
), and then use the fetchedinternal_id
for another simple query:
SELECT * FROM A
WHERE internal_id=59; -- 59 is the fetched internal_id from the other query
Does it have better performance than JOIN
considering an extra round trip between app and database?
- forgetting
internal_id
and useexternal_id
as the primary key and foreign key, with the penalty of 4 more extra bytes per record in each other tables (A, B, C, D, E
) and cost of billions of more storage space or potentially even slower database operations (because of bigger database files):
SELECT * FROM A
WHERE external_id=5
Update:
agency
table may contains 10s of thousands or at most a few millions of rows.internal_id
andexternal_id
will not change over time, but other non-identity columns may rarely change.- There are about 5 to 7 related tables (
A, B, C, D, E, ...
) that a few of them may get too large over time, say a few million rows per day (billions of rows over a year)
Best Answer
Assumptions
agency
has fewer rows than the "millions and billions" you mention for other tables. Way below the range ofinteger
: -2147483648 to +2147483647. Else we needbigint
forinternal_id
to begin with.But
agency
is still big. Else, don't bother with the index optimizations below.Both
internal_id
andexternal_id
hardly ever change.ID values are roughly evenly distributed. Not a few extremely common agencies and many very rare ones. (That might favor query optimization without key translation.)
I would consider a combination of scenario 1 & 2, using this query style:
The subquery encapsulates the key translation and can be used as drop-in replacement for providing a literal
internal_id
. Also makes the job of the query planner a bit simpler when involving many joins.Unless you reuse
internal_id
for many subsequent queries, a separate lookup needlessly adds costs for a separate round trip to the server.You might encapsulate the key translation in a simple SQL function:
Then the above query becomes:
The function can be "inlined" by the query planner. See:
I suggest this table definition:
This provides the crucial indices on
(internal_id, external_id)
and(external_id, internal_id)
and enforces the constraints you mentioned, without redundant indices.The second one (
UNIQUE (external_id) INCLUDE (internal_id)
) is intended for reverse lookups. Seems likely you need that, too. Else you can skip theINCLUDE
clause there. Why do we need both indices? See:It makes heavy use of covering indices (Postgres 11 or later). See:
Among other things, covering indices negate the ballast of additional columns in
agency
for the purpose of key translation.With these indices in place, key translation burns down to very fast index-only scans for key translation. The cost will be practically negligible in the context of queries on your huge tables.
This saves "millions and billions" times 4 bytes for every additional table and index (which may matter a lot more). True, storage is getting cheaper all the time, but RAM (and fast cache memory!) is still typically limited. Bigger tables and indices means that less of it can stay in cache. And that's crucial for performance.
Wider rows always affect overall performance of the database more or less negatively, even with cheap storage. Related discussion:
And it's typically much easier on the human eye to operate with smaller
integer
numbers in the many tables (and log files, and debugging, ...). May even be the most important practical benefit.