Scenario: suppose I have a list of numbers written in something like JSON stored in a database field. Something like this:
ids{1,2,3}
Let's suppose the database is filled with rows that each have a an IDs JSON field like this
entry1: someUniqueID, ids{1,2,4}
entry2: someUniqueID, ids{1,2,5}
entry3: someUniqueID, ids{1,2,3}
Is it possible with any type of database (sql, nosql, graph etc.) to efficiently structure the database for finding the entry with the closest match? As in
looking for the field ids inside a database that most closely resembles
ids{1,2,3}
entry1 has 2 numbers inside that match which are 1 and 2. therefore 66,6% match
entry2 has 2 numbers inside that match which are 1 and 2. therefore 66,6% match
entry3 has 3 numbers inside that match which are 1,2 and 3. therefore 100% match
the Query would return entry3
I know it's of course possible to read each row and compare the values using code and find the closest match. Of course if you have a large database with many thousands or even millions of rows each holding the ids field. Getting result would take ages.
Is there any database technology that would allow for fast, sub 1 seconds result of this kind, even if ids holds much more than 3 values.
By the way the values don't have to be stored using JSON. Any Direction to for me to look into would be appreciated.
Best Answer
Just a rough sketch on how you could do that with Postgres arrays.
There are two array operators that you might be interested in. The "overlaps" operator that returns true if some of the elements are the same and the "contains" operator that checks if all elements of one array are contained in the other - that would be your 100% match:
Assuming this table:
You can create an index on that column:
A query to find the 100% match looks like this:
That will return only rows where
ids
contains all the values 1,2,3,4,5,6 (but it might contain more).On a table with 2 million rows and arrays from 5 to 100 elements this takes about 20ms on a pretty old test server (the actual runtime depends on the total amount of matches found)
The query to find rows that contain at least one of the values of the parameter uses the
&&
operator and is substantially slower.That takes about 300ms on my test server. But it also returns 500.000 rows! If the condition only returns very few rows (less than thousand) then obviously this is going to be a lot faster.
How you proceed from there is a matter on how restrictive the conditions are that you use. If the
&&
operator only returns a few rows (not half a million) then maybe processing them in your application might be more efficient.Getting the information if the input parameter is completely contained, could be done like this:
Another option would be to write your own function that calculates the match, or maybe use the intarray extension to efficiently calculate the intersection between two arrays.