PostgreSQL Hashing – Using Text Hash to Address Rows

hashingpostgresql

I have a project for a website where I have to address thousands or maybe tens of thousands of objects in a postgres table.

Initially I had the url to be:

example.com/object/{ID}/{TEXT SLUG}/

where {ID} is just an integer that corresponds to a serial primary key in my table, {TEXT SLUG} does nothing practically here except for SEO friendliness reasons.

Now, I want to hide my {ID}, since someone may just request all objects incrementing the {ID} using some easy script and reconstruct my database simply, I thought to design the url to be more modern like

example.com/object/{TEXT SLUG}/

Now, if I just address the table using {TEXT SLUG} directly, wouldn't it be much slower than simply address using {ID} as before even with using some index especially when it gets bigger (I expect something in the range of 40K to 60K rows at maximum but I only have 2K currently in the table)?

I thought to use some lightweight hashing algorithm to hash the {TEXT SLUG} and address the table using the hash with adding an additional column having a copy of the hash. Since Postgres has 4 or 8 byte integers, I am tempted to use some lightweight 128-bit hash algorithm (e.g. Murmurhash), am I thinking right about using the idea of hashing the slug or is there something I am unaware of?

Best Answer

Now, if I just address the table using {TEXT SLUG} directly, wouldn't it be much slower than simply address using {ID} as before even with using some index especially when it gets bigger (I expect something in the range of 40K to 60K rows at maximum but I only have 2K currently in the table)?

60K rows is really small by today's standard, and the difference between looking up by an integer primary key and by an indexed text slug is likely not to be noticeable in practice.

That being said, there's still room for improvement in the change you're considering if you had to squeeze performance anyway.

Firstly, there's no real need to add a column for the hashed value because PostgreSQL can directly index function results:

CREATE INDEX index_name ON table_name(hashfunc(ID));

Secondly, there's not even the need for a hash stored in an additional index. Rather, a format-preserving encryption scheme could be used to export your ID in unguessable form through the slug. This technique has several advantages over the hash:

no storage: ID-to-slug and slug-to-ID translations are done by immutable function(s) of O(1) complexity.
hashes are theorically not unique, whereas encryption comes with a mathematical guarantee of unicity. See Which hashing algorithm is best for uniqueness and speed? for a nice study of common fast hash functions (when it reports no collision when hashing numbers, remember that they need to be salted to avoid an outsider guessing).
128-bit output makes a larger slug than necessary for a 32-bit input. For instance when expressed in base64, 128-bit needs up to 24 characters versus 8 for 32-bit.

To transform the sequential-looking 32-bit IDs into an equivalent unguessable sequence with a secret key, I'd recommend skip32, which has a plpgsql implementation here: https://wiki.postgresql.org/wiki/Skip32

Then this output could be pushed as-is into the URL (be aware that it's signed, though). Or if you favor the text slug, transform it to a slug with an int-to-text reversible conversion of your choice.

Related Solutions

Postgresql – find all JSON scalars in PostgreSQL 9.3

Horrible as it is, you could write a PL/PgSQL function that attempts to extract an element from the object, traps the exception if it fails, and returns true on success or false on exception.

Untested:

create or replace function json_isobject(obj json)
returns boolean immutable language plpgsql as $$
begin
    begin
        perform obj->'';
    exception
      when invalid_parameter_value then
        return false;
      when others then
        raise;
    end;
    return true;
end;
$$;

This'll be pretty inefficient, since it's creating a sub-transaction for each call.

Or you could just update to 9.4.

On 9.4 I can't reproduce your reported error. Trying to look up a key in a json scalar returns null.

test=> select ('{"k":"v"}'::json) -> 'k' -> 'blah';
 ?column? 
----------

(1 row)

... and the same with using ->> to dereference a scalar.

So I'm not sure if this function will work on 9.3, and it won't on 9.4. To be 9.4-compatible you'd have to test if current_setting('pg_version_num') >= 90400 and return a result based on json_typeof instead.

You might need to adapt it to use the same expression you use to trigger your reported error.

PostgreSQL Performance – Full Text Search Over Multiple Related Tables

Optimization

You're going down the right track.

You either need to

Denormalize
Cache

Caching the results

What you probably want is a MATERIALIZED VIEW. This is easy and works reasonably well.

CREATE MATERIALIZED VIEW foo
AS
SELECT t.id, to_tsvector(concat_ws(' ',a.name, o.address, t.description, a.name)) AS tsv
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id 
;

Then just

SELECT * FROM foo WHERE tsv @@ plainto_tsquery('foo bar');

Denormalizing the table

This can take a lot of forms, you've got this right though..

Redesign

Searching everything in a fuzzy fashion like this is a losing game. Even this knock off of Dungeon and Dragons meets Yahoo Answers has rules.

It becomes a lot easier to generate a query when you introduce syntax likes [text] for tagging, and is:answer to search just answers, rather than rebuilding Google and normalized indexes.