Context
I'm designing a database (on PostgreSQL 9.6) which will store data from a distributed application. Due to the application's distributed nature, I can not use auto-increment integers (SERIAL
) as my primary key because of potential race-conditions.
The natural solution is to use an UUID, or a globally unique identifier. Postgres comes with a built-in UUID
type, which is a perfect fit.
The problem I have with UUID is related to debugging: it's a non-human-friendly string. The identifier ff53e96d-5fd7-4450-bc99-111b91875ec5
tells me nothing, whereas ACC-f8kJd9xKCd
, while not guaranteed to be unique, tells me I'm dealing with an ACC
object.
From a programming perspective, it is common to debug application queries relating several different objects. Suppose the programmer wrongly searches for an ACC
(account) object at the ORD
(order) table. With a human-readable identifier, the programmer instantly identifies the problem, while using UUIDs he would spend some time figuring out what was wrong.
I do not need the "guaranteed" uniqueness of UUIDs; I do need some room for generating keys without conflicts, but UUID is overkill. Also, worst case scenario, it wouldn't be the end of the world if a collision happened (the database rejects it and the application can recover). So, trade-offs considered, a smaller but human-friendly identifier would be the ideal solution for my use case.
Identifying application objects
The identifier I came up with has the following format: {domain}-{string}
, where {domain}
is replaced with the object domain (account, order, product) and {string}
is a randomly generated string. In some cases, it might even make sense to insert a {sub-domain}
before the random string. Let's ignore the length of {domain}
and {string}
for the purpose of guaranteeing uniqueness.
The format can have a fixed size if it helps the indexing/querying performance.
The problem
Knowing that:
- I want to have primary keys with a format like
ACC-f8kJd9xKCd
. - These primary keys will be part of several tables.
- All these keys will be used on several joins/relationships, on a 6NF database.
- Most tables will have a medium to large-ish size (averaging ~1M rows; largest ones with ~100M rows).
Regarding performance, what is the best way to store this key?
Below are four possible solutions, but since I have little experience with databases I'm unsure which (if any) is the best.
Considered solutions
1. Store as string (VARCHAR
)
(Postgres makes no difference between CHAR(n)
and VARCHAR(n)
, so I'm ignoring CHAR
).
After some research, I've found out that string comparison with VARCHAR
, specially on join operations, is slower than using INTEGER
. This makes sense, but is it something that I should worry about at this scale?
2. Store as binary (bytea
)
Unlike Postgres, MySQL does not have a native UUID
type. There are several posts explaining how to store an UUID using a 16-byte BINARY
field, instead of a 36-byte VARCHAR
one. These posts gave me the idea of storing the key as binary (bytea
on Postgres).
This saves size, but I'm more concerned with performance. I had little luck finding an explanation on which comparison is faster: binary or string ones. I believe binary comparisons are faster. If they are, then bytea
is probably better than VARCHAR
, even though the programmer now has to encode/decode the data every time.
I might be wrong, but I think both bytea
and VARCHAR
will compare (equality) byte by byte (or character by character). Is there a way to "skip" this step-by-step comparison and simply compare "the whole thing"? (I don't think so, but it doesn't cost checking).
I think storing as bytea
is the best solution, but I wonder if there are any other alternatives that I'm ignoring. Also, the same concern I expressed on solution 1 holds true: is the overhead on comparisons enough that I should worry about?
"Creative" solutions
I came up with two very "creative" solutions that could work, I'm just unsure at which extent (i.e. if I would have trouble scaling them to more than a couple thousand rows in a table).
3. Store as UUID
but with a "label" attached to it
The main reason to not use UUIDs is so that programmers can better debug the application. But what if we can use both: the database stores all keys as UUID
s only, but it wraps the object before/after queries are made.
For example, the programmer asks for ACC-{UUID}
, the database ignores the ACC-
part, fetches the results, and return all of them as {domain}-{UUID}
.
Maybe this would be possible with some hackery with stored procedures or functions, but some questions come to mind:
- Is this (removing/adding the domain at each query) a substantial overhead?
- Is this even possible?
I've never used stored procedures or functions before, so I'm not sure whether this is even possible. Can someone shed some light? If I can add a transparent layer between the programmer and the stored data, it seems a perfect solution.
4. (My favorite) Store as IPv6 cidr
Yes, you read it right. It turns out that the IPv6 address format solves my problem perfectly.
- I can add domains and sub-domains at the first few octets, and use the remaining ones as the random string.
- The collision odds are OK. (I wouldn't be using 2^128 though, but it is still OK.)
- Equality comparisons are (hopefully) optimized, so I might get better performance than simply using
bytea
. - I can actually perform some interesting comparisons, like
contains
, depending on how the domains and their hierarchy are represented.
For example, suppose I use code 0000
to represent the domain "products". Key 0000:0db8:85a3:0000:0000:8a2e:0370:7334
would represent the product 0db8:85a3:0000:0000:8a2e:0370:7334
.
The main question here is: compared to bytea
, is there any main advantage or disadvantage on using cidr
data type?
Best Answer
Using
ltree
If IPV6 works, great. It doesn't support "ACC".
ltree
does.You'd use it like this,
We create sample data.
And viola..
See the docs for more info and operators
If you're creating the product ids, I would ltree. If you need something to create them, I would use UUID.