PostgreSQL Database Design – Alternatives to Lookup Tables for Deduplicating Text Columns

database-designpostgresql

My database has VARCHAR columns each containing 20k-60k distinct short texts. It is not feasbile (due to space constraints) to store these values in each row, they need to be externalized somehow. These values are never updated.

The classic approach would be to replace each of these columns with a separate lookup table and corresponding id – this however requires a lot of joins and a lot of similar tables.

Enums don't seem to be feasible for this problem, although they would remove the need for explicit joins: They cant be longer than 63 bytes and it is impossible to add new values in transactions until version 10.

I have looked in the documentation for CREATE TABLE – there seems to be no built-in support for this kind of deduplication. TOAST-Tables don't seem to contain unique values either.

Is there another approach I have missed?

Best Answer

Lots of joins is exactly the usual approach.

You can, however, write a non-STRICT LANGUAGE sql function like lookup_mytext(id) that does a SELECT from the lookup table by id. PostgreSQL will inline it as a subquery, then flatten the subquery to a join. So the effect is the same, but it's notationally simpler to write

SELECT
  x,
  lookup_mytext(y)
FROM t;

than

SELECT
  t.x, tt.val
FROM
  t INNER JOIN tt ON (t.y = tt.id);

Most of the time I prefer to avoid obfuscation and keep the explicit join. But it can become beneficial if this is a real pattern across the app.

See the documentation for rules on function inlining. Check with EXPLAIN.