Optimal Hash Technique for Indexing Large Text in PostgreSQL

byteaindexoptimizationpostgresql

Erwin Brandstetter thankfully saved me from myself when I tried to create unique indexes on textual columns with large lengths.

The upper limit for the insertion rate is tens of billions of rows per year.

For my implementation, the hash never needs to leave the database, but the hashed data must be compared with external data for existence frequently.

In my limited experience with optimization for these purposes, I assume that the best data type for the hash would be bytea. The alternative is of course a much longer hex string.

Is it correct that bytea is the optimal data type for these hashes?

If bytea is not optimal, what is?

How should my intent be implemented?

Clarification

I'm using the hash of the text per Erwin Brandstetter's recommendation to be able to ensure that large text is unique. It is my limited understanding that raw binary data is always the most performant especially when compared to strings.

The hash only needs to be compared for existence to preempt uniqueness violations, so the hash happily never needs to leave the database. Because of libpqxx's incredible design, it appears as if the data can simply be input through a prepared statement and converted with (decode(md5($1::text), 'hex')). When I become more familiar with bytea insertion via libpqxx 3.1, I'll move it to c++.

For this implementation, collisions are acceptable because data can be reconstructed to conform without breaking the system.

If the maximum throughput is happily reached then it should be expected that the economic resources will be available to accommodate it; therefore, the primary concern will always be performance, so if my understanding of the capabilities of Postgres and its ability for partitioned tables to handle these amounts of rows is accurate, it will hopefully be the right tool for the job for a long while. Luckily, data older than a few seconds will never change, and it is my understanding that Postgres partitioned tables can make minced meat out of such data. If not, this will be one of those good problems.

Best Answer

bytea will be optimal for storing the hash.

It'll be transferred in/out of the database as a hex string anyway, unless you use PostgreSQL's binary wire protocol (supported by libpq and partly by PgJDBC) to transfer them.

For best results, store as bytea and have the client application use a PQexecParams call that requests binary results.

Though on re-reading, this is confusing:

For my implementation, the hash never needs to leave the database, but the hashed data must be compared with external data for existence frequently

Do you mean that the hash isn't transferred for comparison, the original unhashed text data is? If so, the above is irrelevant, as the binary protocol offers no benefits for text-form data.

Also: "tens of billions" of rows is a lot. PostgreSQL has quite a large per-row overhead at 28 bytes, so you're going to be losing a lot of space. Especially once you factor in index overheads too. Is PostgreSQL the right tool for this job?

A final thought: With that many rows, you're getting up into hash-collision territory. Do you care if it's possible - though unlikely - for two different strings to have the same hash, so an incorrect unique violation is reported? If that's a problem then a unique b-tree index on the hash probably isn't the right tool for the job.