Postgresql – Fast hamming distance queries in postgres

indexpostgresqlpostgresql-9.3

I have a large database (16M rows) containing perceptual hashes of images.

I'd like to be able to search for rows by hamming distance in a reasonable timeframe.

Currently, as far as I properly understand the issue, I think the best option here would be a custom SP-GiST implementation that implements a BK-Tree, but that seems like a lot of work, and I'm still fuzzy on the practical details of properly implementing a custom index. Calculating the hamming distance is tractable enough, and I do know C, though.

Basically, what is the appropriate approach here? I need to be able to query for matches within a certain edit-distance of a hash. As I understand it, Levenshtein distance with strings of equal length is functionally hamming distance, so there is at least some existing support for what I want, though no clear way to create an index from it (remember, the value I'm querying for changes. I cannot pre-compute the distance from a fixed value, since that would only be useful for that one value).

The hashes are currently stored as a 64-char string containing the binary ASCII encoding of the hash (e.g. "10010101…"), but I can convert them to int64 easily enough. The real issue is I need to be able to query relatively fast.

It seems like it could be possible to achieve something along the lines of what I want with the pg_trgm, but I'm a bit unclear on how the trigram matching mechamism works (in particular, what does the similarity metric it returns actually represent? It looks kind of like edit-distance).

Insert performance is not critical (it's very computationally expensive to calculate the hashes for each row), so I primarily care about searching.

Best Answer

Well, I spent a while looking at writing a custom postgres C extension, and wound up just writting a Cython database wrapper that maintains a BK-tree structure in memory.

Basically, it maintains a in-memory copy of the phash values from the database, and all updates to the database are replayed into the BK-tree.

It's all up on github here. It also has a LOT of unit-tests.

Querying across a dataset of 10 million hash values for items with a distance of 4 results in touching ~0.25%-0.5% of the values in the tree, and takes ~100 ms.

Related Solutions

Postgresql – Postgres pg_trgm JOIN multiple columns with large tables (~50 million rows)

This type of join cannot effectively use an index-to-index join. One of the tables is going to need to be seq scanned. You might want to try forcing it to reverse which table gets seq scanned, for example by dropping the gin index on the larger table so it can't be used, or appending the empty string to each column of the large table, such as on ((b.name||'') % a.track_title and .... Doing this will probably only work if you are limited by IO, not CPU.

Also, should I expect this to run in a reasonable amount of time (minutes) on tables this large, even with indexes?

No, I don't think so. There might be some things you can do to improve performance, but you are fundamentally asking the database to do a massive amount of work and it will take a long time to do it.

Make sure you are using the latest version of pg_trgm, which is 1.3. This can help a lot for LIKE queries, but don't expect miracles for % queries.

Increase the value of pg_trgm.similarity_threshold as much as you can while still obtaining the number of results you want. The default value of 0.3 is really very low.

It is not clear that the multi-column index is actually going to be helpful. Try dropping it and see how it does with the individual column indexes.

Also, just be patient. This looks to me like a data-cleaning exercise, not something you need to run interactively. Maintain a materialized view, or a join table updated by triggers, to record the matches if need to be able to retrieve them interactively.

Mysql – Increasing query efficiency of queries spanning multiple tables

First , no point in discussing index when query is not clear nor proper.

Second and also most important,when you hv use so many condition then still why use DISTINCT.

ARe you not able to elimindate Duplicate record ?In absense of data "DISTINCT" is misleading.

Als when you get your query correct then use count(*),meanwhile you can only use *.

IMHO,my script is in right direction if not right.

SELECT
    *
FROM lead l  (  ( lead.fname IS NOT NULL AND lead.fname <> ? )  OR  ( lead.lname IS NOT NULL AND lead.lname <> ? )  )
AND  lead.country IN ( ?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,? )
AND  lead.id IN ( SELECT lead_tag.lead_id FROM lead_tag LEFT JOIN tag ON lead_tag.tag_id = tag.id WHERE tag.tag LIKE :tag_tag )
AND  lead.fname <> :lead_fname

AND exists( SELECT lead_phone.lead_id FROM lead_phone LEFT JOIN phone ON lead_phone.phone_id = phone.id WHERE  l.id=lead_phone.lead_id  and phone.valid = :phone_valid and phone.line_type IN ( ?,? ))

AND  exists ( SELECT email_lead.lead_id FROM email_lead LEFT JOIN email ON email_lead.email_id = email.id 
WHERE l.id=email_lead.lead_id and   email.valid_domain = :email_valid_domain )

Best Answer

Related Solutions

Postgresql – Postgres pg_trgm JOIN multiple columns with large tables (~50 million rows)

Mysql – Increasing query efficiency of queries spanning multiple tables

Related Question