PostgreSQL Database Design – Alternatives to Lookup Tables for Deduplicating Text Columns

database-designpostgresql

My database has VARCHAR columns each containing 20k-60k distinct short texts. It is not feasbile (due to space constraints) to store these values in each row, they need to be externalized somehow. These values are never updated.

The classic approach would be to replace each of these columns with a separate lookup table and corresponding id – this however requires a lot of joins and a lot of similar tables.

Enums don't seem to be feasible for this problem, although they would remove the need for explicit joins: They cant be longer than 63 bytes and it is impossible to add new values in transactions until version 10.

I have looked in the documentation for CREATE TABLE – there seems to be no built-in support for this kind of deduplication. TOAST-Tables don't seem to contain unique values either.

Is there another approach I have missed?

Best Answer

Lots of joins is exactly the usual approach.

You can, however, write a non-STRICT LANGUAGE sql function like lookup_mytext(id) that does a SELECT from the lookup table by id. PostgreSQL will inline it as a subquery, then flatten the subquery to a join. So the effect is the same, but it's notationally simpler to write

SELECT
  x,
  lookup_mytext(y)
FROM t;

than

SELECT
  t.x, tt.val
FROM
  t INNER JOIN tt ON (t.y = tt.id);

Most of the time I prefer to avoid obfuscation and keep the explicit join. But it can become beneficial if this is a real pattern across the app.

See the documentation for rules on function inlining. Check with EXPLAIN.

Related Solutions

Sql-server – Bitmask Flags with Lookup Tables Clarification

I partially agree with Aaron's comment - in the most general case for storing 21 unrelated pieces of information, you'd probably use 21 bit columns. As a general solution, it may well be your best solution. If you had multiple bitmask-ed varchar columns, that would translate to a row with possibly over a hundred bit flags. FYI, 21 bits get stored as 3 bytes when you don't define them as NULLable, removing the necessity for space in the NULL bitmap. Since you have multiple bitmask columns, you'd end up with every 8 bits mashed into a byte.

What SQL Server ends up doing with your multi-column queries is eventually a bunch of bitmasking routines (yes! SQL Server uses bitmasks, so they the concept per se can't be all bad!) but for average use cases, it makes life easier for you.

If we had more information about what types of queries you run, we may be able to better advise, because ultimately the use cases dictate the design.

If you persist with the COMPUTED column, I would persist and index it if you haven't already. It helps some queries, such as

exact matches

WHERE computedInt = POWER(2, 6) -- bit position 7
AND matching on 15th bit and OR matching on 2 other bits (10th and 7th)

WHERE computedInt >= Power(2,14) AND computedInt < Power(2,15) AND computedInt & (Power(2,9) + Power(2,6)) > 0

But these are probably exotic samples and yet also real live in some cases. It's certainly not too much worse than 21 individual bit columns, for which yes your statements could be easier to write, but remember that SQL Server has mashed them for storage into 3 bytes and will be doing the bit-unmasking anyway! You would have thought if bit-masking were all bad (without exception) then SQL Server wouldn't be doing it, right?

EDIT

Re the scenario of

Four flags, HasHouse,HasCar,HasCat,HasDog, 0000 is has none, 1111 is has all.

it is more efficient and logically expedient to test all 4 bits at once and do a single integer based operation, e.g.

WHERE computedInt & (POWER(2,10)+POWER(2,5)+POWER(2,3)+POWER(2,1)) = 0 -- has none
WHERE computedInt & (POWER(2,10)+POWER(2,5)+POWER(2,3)+POWER(2,1)) > 0 -- has one or more

Hypothetically, if this were your most exercised query on the table, you might even group the four columns into another computed column and index it separately, making the bitmask unnecessary (just test the resultant int with =0 and >0). You might even go further and just precompute the answer... horses for courses.

Postgresql – How to store thousands of properties related to a record in PostgreSQL

You could create a table to hold the key/value pairs like such:

CREATE TABLE KEY_VALUE (
  ID BIGINT, -- THIS COULD BE A FKEY TO YOUR '300K' RECORD TABLE
  KEY VARCHAR,
  VALUE VARCHAR
);

As long as this table is indexed properly (an index on id/key, I imagine), getting any key/value pairs you are interested in should still be very quick, even if it's millions of rows large. Granted - this solution as a viable option really depends on what scope of data you expect to store with your key/value pairs (is it all text? or numbers? etc..). Perhaps adding a 3rd column to say what type of data the value is would help.

EDIT

https://sqlblog.org/2009/11/19/what-is-so-bad-about-eav-anyway

Just to be clear, btw - I figured this might work given the sheer number of different key->value pairs you said you would have.. If it were a small number, I would probably just have a "details" type of table where each value was stored in a column.

Best Answer

Related Solutions

Sql-server – Bitmask Flags with Lookup Tables Clarification

Postgresql – How to store thousands of properties related to a record in PostgreSQL

Related Question