PostgreSQL storing vectors and computing dot product

postgresql

I am working on a small NLP project in Python that involves performing text similarity on documents(news articles, blog posts etc) and I am thinking of storing each document features vector in Postgres, for instance say I have a document with the content Breakthrough on cancer research..., assuming after all the Python preprocessing it would become something like [(1, 0.225), (2, 0.1), (5, 0.11),...] where the first item in the tuple represents the index of the "feature" (in this case a word), and the second item represents its "score/magnitude".

Now if I have a second document with a similar data structure, say query_vector = [(1, 0.525), (6, 0.9), (7, 0.1221),...] I need to run a query like SELECT * FROM vectors_table WHERE dot_product(vectors_table.features_vector, query_vector) > 0.5. What dot_product will do is multiple the "score/magnitude" of both vectors, which in the example above its gonna be SUM(0.525*0.225, 0.9*0, 0.1221*0, ...) (0 because the other vector don't have any score for the feature with index 6 or 7).

I have a working solution but it is way too slow. My most straightforward solution is to store the vectors as jsonb so it'll become something like {1: 0.225, 2: 0.1, 5:0.11, ...} and I also have just the keys on the jsonb in another column (for faster array intersection computation later on) then I have an UDF as such:

CREATE OR REPLACE FUNCTION dot_product_overlap(query jsonb, query_keys anyarray, target jsonb, target_keys anyarray)
RETURNS float AS $$
DECLARE 
    row integer;
    result float := 0;
BEGIN
       FOR row IN SELECT * FROM array_intersect(query_keys, target_keys)
       LOOP
          result:=result+((query->>row)::float*(target->>row)::float);
       END LOOP;
    RETURN result;
END;
$$  LANGUAGE plpgsql;

WHERE array_intersect is another UDF to compute the intersection between 2 array (something like & in intarray). What this UDF did was first compute the intersection between two array of keys and then only compute the multiplication if both vectors has the same key and lastly sum it all up.

This actually works perfectly, just that its extremely slow, running a query against a table with 5000 documents takes ~5s. Now I was wondering is there a more efficient way to store these feature vectors and performing the dot product? I researched a little bit and stumbled upon MADlib cosine_similarity function (which is essentially dot product in this case) but AFAIK it doesn't fit the structure of my feature vectors.

Best Answer

You should write the function in C if you want it to run as fast as possible. Then it might also be a good idea to write your own data type that is stored with a hash table.

If that is too hard, and you don't need that last little bit of performance, you could try to rewrite your function in Perl or Python — both can be run inside the database.