PostgreSQL Hashing – Using Text Hash to Address Rows

hashingpostgresql

I have a project for a website where I have to address thousands or maybe tens of thousands of objects in a postgres table.

Initially I had the url to be:

example.com/object/{ID}/{TEXT SLUG}/

where {ID} is just an integer that corresponds to a serial primary key in my table, {TEXT SLUG} does nothing practically here except for SEO friendliness reasons.

Now, I want to hide my {ID}, since someone may just request all objects incrementing the {ID} using some easy script and reconstruct my database simply, I thought to design the url to be more modern like

example.com/object/{TEXT SLUG}/

Now, if I just address the table using {TEXT SLUG} directly, wouldn't it be much slower than simply address using {ID} as before even with using some index especially when it gets bigger (I expect something in the range of 40K to 60K rows at maximum but I only have 2K currently in the table)?

I thought to use some lightweight hashing algorithm to hash the {TEXT SLUG} and address the table using the hash with adding an additional column having a copy of the hash. Since Postgres has 4 or 8 byte integers, I am tempted to use some lightweight 128-bit hash algorithm (e.g. Murmurhash), am I thinking right about using the idea of hashing the slug or is there something I am unaware of?

Best Answer

Now, if I just address the table using {TEXT SLUG} directly, wouldn't it be much slower than simply address using {ID} as before even with using some index especially when it gets bigger (I expect something in the range of 40K to 60K rows at maximum but I only have 2K currently in the table)?

60K rows is really small by today's standard, and the difference between looking up by an integer primary key and by an indexed text slug is likely not to be noticeable in practice.

That being said, there's still room for improvement in the change you're considering if you had to squeeze performance anyway.

Firstly, there's no real need to add a column for the hashed value because PostgreSQL can directly index function results:

CREATE INDEX index_name ON table_name(hashfunc(ID));

Secondly, there's not even the need for a hash stored in an additional index. Rather, a format-preserving encryption scheme could be used to export your ID in unguessable form through the slug. This technique has several advantages over the hash:

  • no storage: ID-to-slug and slug-to-ID translations are done by immutable function(s) of O(1) complexity.

  • hashes are theorically not unique, whereas encryption comes with a mathematical guarantee of unicity. See Which hashing algorithm is best for uniqueness and speed? for a nice study of common fast hash functions (when it reports no collision when hashing numbers, remember that they need to be salted to avoid an outsider guessing).

  • 128-bit output makes a larger slug than necessary for a 32-bit input. For instance when expressed in base64, 128-bit needs up to 24 characters versus 8 for 32-bit.

To transform the sequential-looking 32-bit IDs into an equivalent unguessable sequence with a secret key, I'd recommend skip32, which has a plpgsql implementation here: https://wiki.postgresql.org/wiki/Skip32

Then this output could be pushed as-is into the URL (be aware that it's signed, though). Or if you favor the text slug, transform it to a slug with an int-to-text reversible conversion of your choice.