How to Store and Index High-Dimensional Vectors for Fast Lookup by Cosine Similarity

arraydatabase-designdimensionindexpostgresql

I am trying to store vectors for word/doc embeddings in a PostgreSQL table, and want to be able to quickly pull the N rows with highest cosine similarity to a given query vector. The vectors I'm working with are numpy.arrays of floats with length 100 <= L <= 1000.

I looked into the cube module for similarity search, but it is limited to vectors with <= 100 dimensions. The embeddings I am using will result in vectors that are 100-dimensions minimum and often much higher (depending on settings when training word2vec/doc2vec models).

What is the most efficient way to store large dimensional vectors (numpy float arrays) in Postgres, and perform quick lookup based on cosine similarity (or other vector similarity metrics)?

Best Answer

I'm not aware of any existing modules that will do this for you. It will be hard to index this in a traditional sense of jumping to a specific and small part of the index which covers all possibly-qualifying rows.

Will the vectors be centered? Normalized? Could you get away with representing them as integers with an implicit divisor, rather than true floats?

Your best hope may be to implement an index somewhat like the Bloom extension, in which it always scans the full "index" but does it in a way which is much faster than it would it be to scan the table, both because the index is smaller, and because the data is stored in a way that doesn't have to go through the type abstraction machinery but rather uses the CPUs more directly.