Postgresql – Non-integer primary key considerations

database-designpostgresql

Context

I'm designing a database (on PostgreSQL 9.6) which will store data from a distributed application. Due to the application's distributed nature, I can not use auto-increment integers (SERIAL) as my primary key because of potential race-conditions.

The natural solution is to use an UUID, or a globally unique identifier. Postgres comes with a built-in UUID type, which is a perfect fit.

The problem I have with UUID is related to debugging: it's a non-human-friendly string. The identifier ff53e96d-5fd7-4450-bc99-111b91875ec5 tells me nothing, whereas ACC-f8kJd9xKCd, while not guaranteed to be unique, tells me I'm dealing with an ACC object.

From a programming perspective, it is common to debug application queries relating several different objects. Suppose the programmer wrongly searches for an ACC (account) object at the ORD (order) table. With a human-readable identifier, the programmer instantly identifies the problem, while using UUIDs he would spend some time figuring out what was wrong.

I do not need the "guaranteed" uniqueness of UUIDs; I do need some room for generating keys without conflicts, but UUID is overkill. Also, worst case scenario, it wouldn't be the end of the world if a collision happened (the database rejects it and the application can recover). So, trade-offs considered, a smaller but human-friendly identifier would be the ideal solution for my use case.

Identifying application objects

The identifier I came up with has the following format: {domain}-{string}, where {domain} is replaced with the object domain (account, order, product) and {string} is a randomly generated string. In some cases, it might even make sense to insert a {sub-domain} before the random string. Let's ignore the length of {domain} and {string} for the purpose of guaranteeing uniqueness.

The format can have a fixed size if it helps the indexing/querying performance.

The problem

Knowing that:

  • I want to have primary keys with a format like ACC-f8kJd9xKCd.
  • These primary keys will be part of several tables.
  • All these keys will be used on several joins/relationships, on a 6NF database.
  • Most tables will have a medium to large-ish size (averaging ~1M rows; largest ones with ~100M rows).

Regarding performance, what is the best way to store this key?

Below are four possible solutions, but since I have little experience with databases I'm unsure which (if any) is the best.

Considered solutions

1. Store as string (VARCHAR)

(Postgres makes no difference between CHAR(n) and VARCHAR(n), so I'm ignoring CHAR).

After some research, I've found out that string comparison with VARCHAR, specially on join operations, is slower than using INTEGER. This makes sense, but is it something that I should worry about at this scale?

2. Store as binary (bytea)

Unlike Postgres, MySQL does not have a native UUID type. There are several posts explaining how to store an UUID using a 16-byte BINARY field, instead of a 36-byte VARCHAR one. These posts gave me the idea of storing the key as binary (bytea on Postgres).

This saves size, but I'm more concerned with performance. I had little luck finding an explanation on which comparison is faster: binary or string ones. I believe binary comparisons are faster. If they are, then bytea is probably better than VARCHAR, even though the programmer now has to encode/decode the data every time.

I might be wrong, but I think both bytea and VARCHAR will compare (equality) byte by byte (or character by character). Is there a way to "skip" this step-by-step comparison and simply compare "the whole thing"? (I don't think so, but it doesn't cost checking).

I think storing as bytea is the best solution, but I wonder if there are any other alternatives that I'm ignoring. Also, the same concern I expressed on solution 1 holds true: is the overhead on comparisons enough that I should worry about?

"Creative" solutions

I came up with two very "creative" solutions that could work, I'm just unsure at which extent (i.e. if I would have trouble scaling them to more than a couple thousand rows in a table).

3. Store as UUID but with a "label" attached to it

The main reason to not use UUIDs is so that programmers can better debug the application. But what if we can use both: the database stores all keys as UUIDs only, but it wraps the object before/after queries are made.

For example, the programmer asks for ACC-{UUID}, the database ignores the ACC- part, fetches the results, and return all of them as {domain}-{UUID}.

Maybe this would be possible with some hackery with stored procedures or functions, but some questions come to mind:

  • Is this (removing/adding the domain at each query) a substantial overhead?
  • Is this even possible?

I've never used stored procedures or functions before, so I'm not sure whether this is even possible. Can someone shed some light? If I can add a transparent layer between the programmer and the stored data, it seems a perfect solution.

4. (My favorite) Store as IPv6 cidr

Yes, you read it right. It turns out that the IPv6 address format solves my problem perfectly.

  • I can add domains and sub-domains at the first few octets, and use the remaining ones as the random string.
  • The collision odds are OK. (I wouldn't be using 2^128 though, but it is still OK.)
  • Equality comparisons are (hopefully) optimized, so I might get better performance than simply using bytea.
  • I can actually perform some interesting comparisons, like contains, depending on how the domains and their hierarchy are represented.

For example, suppose I use code 0000 to represent the domain "products". Key 0000:0db8:85a3:0000:0000:8a2e:0370:7334 would represent the product 0db8:85a3:0000:0000:8a2e:0370:7334.

The main question here is: compared to bytea, is there any main advantage or disadvantage on using cidr data type?

Best Answer

Using ltree

If IPV6 works, great. It doesn't support "ACC". ltree does.

A label path is a sequence of zero or more labels separated by dots, for example L1.L2.L3, representing a path from the root of a hierarchical tree to a particular node. The length of a label path must be less than 65kB, but keeping it under 2kB is preferable. In practice this is not a major limitation; for example, the longest label path in the DMOZ catalog (http://www.dmoz.org) is about 240 bytes.

You'd use it like this,

CREATE EXTENSION ltree;
SELECT replace('ACC-f8kJd9xKCd', '-', '.')::ltree;

We create sample data.

SELECT x, (
  CASE WHEN x%7=0 THEN 'ACC'
    WHEN x%3=0 THEN 'XYZ'
    ELSE 'COM'
  END ||'.'|| md5(x::text)
  )::ltree
FROM generate_series(1,10000) AS t(x);

CREATE INDEX ON foo USING GIST (ltree);
ANALYZE foo;


  x  |                ltree                 
-----+--------------------------------------
   1 | COM.c4ca4238a0b923820dcc509a6f75849b
   2 | COM.c81e728d9d4c2f636f067f89cc14862c
   3 | XYZ.eccbc87e4b5ce2fe28308fd9f2a7baf3
   4 | COM.a87ff679a2f3e71d9181a67b7542122c
   5 | COM.e4da3b7fbbce2345d7772b0674a318d5
   6 | XYZ.1679091c5a880faf6fb5e6087eb1b2dc
   7 | ACC.8f14e45fceea167a5a36dedd4bea2543
   8 | COM.c9f0f895fb98ab9159f51fd0297e236d

And viola..

                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=103.23..234.91 rows=1414 width=57) (actual time=0.422..0.908 rows=1428 loops=1)
   Recheck Cond: ('ACC'::ltree @> ltree)
   Heap Blocks: exact=114
   ->  Bitmap Index Scan on foo_ltree_idx  (cost=0.00..102.88 rows=1414 width=0) (actual time=0.389..0.389 rows=1428 loops=1)
         Index Cond: ('ACC'::ltree @> ltree)
 Planning time: 0.133 ms
 Execution time: 1.033 ms
(7 rows)

See the docs for more info and operators

If you're creating the product ids, I would ltree. If you need something to create them, I would use UUID.