Postgresql – Fast hamming distance queries in postgres

indexpostgresqlpostgresql-9.3

I have a large database (16M rows) containing perceptual hashes of images.

I'd like to be able to search for rows by hamming distance in a reasonable timeframe.

Currently, as far as I properly understand the issue, I think the best option here would be a custom SP-GiST implementation that implements a BK-Tree, but that seems like a lot of work, and I'm still fuzzy on the practical details of properly implementing a custom index. Calculating the hamming distance is tractable enough, and I do know C, though.

Basically, what is the appropriate approach here? I need to be able to query for matches within a certain edit-distance of a hash. As I understand it, Levenshtein distance with strings of equal length is functionally hamming distance, so there is at least some existing support for what I want, though no clear way to create an index from it (remember, the value I'm querying for changes. I cannot pre-compute the distance from a fixed value, since that would only be useful for that one value).

The hashes are currently stored as a 64-char string containing the binary ASCII encoding of the hash (e.g. "10010101…"), but I can convert them to int64 easily enough. The real issue is I need to be able to query relatively fast.

It seems like it could be possible to achieve something along the lines of what I want with the pg_trgm, but I'm a bit unclear on how the trigram matching mechamism works (in particular, what does the similarity metric it returns actually represent? It looks kind of like edit-distance).

Insert performance is not critical (it's very computationally expensive to calculate the hashes for each row), so I primarily care about searching.

Best Answer

Well, I spent a while looking at writing a custom postgres C extension, and wound up just writting a Cython database wrapper that maintains a BK-tree structure in memory.

Basically, it maintains a in-memory copy of the phash values from the database, and all updates to the database are replayed into the BK-tree.

It's all up on github here. It also has a LOT of unit-tests.

Querying across a dataset of 10 million hash values for items with a distance of 4 results in touching ~0.25%-0.5% of the values in the tree, and takes ~100 ms.