How to Store and Index High-Dimensional Vectors for Fast Lookup by Cosine Similarity

arraydatabase-designdimensionindexpostgresql

I am trying to store vectors for word/doc embeddings in a PostgreSQL table, and want to be able to quickly pull the N rows with highest cosine similarity to a given query vector. The vectors I'm working with are numpy.arrays of floats with length 100 <= L <= 1000.

I looked into the cube module for similarity search, but it is limited to vectors with <= 100 dimensions. The embeddings I am using will result in vectors that are 100-dimensions minimum and often much higher (depending on settings when training word2vec/doc2vec models).

What is the most efficient way to store large dimensional vectors (numpy float arrays) in Postgres, and perform quick lookup based on cosine similarity (or other vector similarity metrics)?

Best Answer

I'm not aware of any existing modules that will do this for you. It will be hard to index this in a traditional sense of jumping to a specific and small part of the index which covers all possibly-qualifying rows.

Will the vectors be centered? Normalized? Could you get away with representing them as integers with an implicit divisor, rather than true floats?

Your best hope may be to implement an index somewhat like the Bloom extension, in which it always scans the full "index" but does it in a way which is much faster than it would it be to scan the table, both because the index is smaller, and because the data is stored in a way that doesn't have to go through the type abstraction machinery but rather uses the CPUs more directly.

Related Solutions

Postgresql – ny index with O(1) complexity for lookup in PostgreSQL

You'll need to the read the PostgresSQL Index Types documentation for the version that you are using. I think you are looking for a Hash Index, but as as the documentation explains the real world performance and pros and cons depend on the implementation details and recommendations of the particular version that you are using.

SQL Server – Hard and Fast Rule for Including Columns in Index

That index suggestion by marc_s is wrong. I've added a comment. (And it was my answer accepted too!)

The index for this query would be

CREATE NONCLUSTERED INDEX NC_EmpDep 
  ON Employee(DepartmentID)
  INCLUDE (Lastname, EmployeeID)

An index is typically

CREATE INDEX <name> ON <table> (KeyColList) INCLUDE (NonKeyColList)

Where:

KeyColList = Key columns = used for row restriction and processing
WHERE, JOIN, ORDER BY, GROUP BY etc
NonKeyColList = Non-key columns = used in SELECT and aggregation (e.g. SUM(col)) after selection/restriction

Best Answer

Related Solutions

Postgresql – ny index with O(1) complexity for lookup in PostgreSQL

SQL Server – Hard and Fast Rule for Including Columns in Index

Related Question