Database Design – Finding Closest Matching JSON Entry

database-design

Scenario: suppose I have a list of numbers written in something like JSON stored in a database field. Something like this:

ids{1,2,3}

Let's suppose the database is filled with rows that each have a an IDs JSON field like this

entry1: someUniqueID, ids{1,2,4}

entry2: someUniqueID, ids{1,2,5}

entry3: someUniqueID, ids{1,2,3}

Is it possible with any type of database (sql, nosql, graph etc.) to efficiently structure the database for finding the entry with the closest match? As in
looking for the field ids inside a database that most closely resembles

ids{1,2,3}

entry1 has 2 numbers inside that match which are 1 and 2. therefore 66,6% match

entry2 has 2 numbers inside that match which are 1 and 2. therefore 66,6% match

entry3 has 3 numbers inside that match which are 1,2 and 3. therefore 100% match

the Query would return entry3

I know it's of course possible to read each row and compare the values using code and find the closest match. Of course if you have a large database with many thousands or even millions of rows each holding the ids field. Getting result would take ages.

Is there any database technology that would allow for fast, sub 1 seconds result of this kind, even if ids holds much more than 3 values.

By the way the values don't have to be stored using JSON. Any Direction to for me to look into would be appreciated.

Best Answer

Just a rough sketch on how you could do that with Postgres arrays.

There are two array operators that you might be interested in. The "overlaps" operator that returns true if some of the elements are the same and the "contains" operator that checks if all elements of one array are contained in the other - that would be your 100% match:

Assuming this table:

create table idlist (ids int[]);

You can create an index on that column:

create index on idlist using gin (ids);

A query to find the 100% match looks like this:

select *
from idlist
where ids @> array[1,2,3,4,5,6]

That will return only rows where ids contains all the values 1,2,3,4,5,6 (but it might contain more).

On a table with 2 million rows and arrays from 5 to 100 elements this takes about 20ms on a pretty old test server (the actual runtime depends on the total amount of matches found)

The query to find rows that contain at least one of the values of the parameter uses the && operator and is substantially slower.

select *
from idlist
where ids && array[1,2,3,4,5,6];

That takes about 300ms on my test server. But it also returns 500.000 rows! If the condition only returns very few rows (less than thousand) then obviously this is going to be a lot faster.

How you proceed from there is a matter on how restrictive the conditions are that you use. If the && operator only returns a few rows (not half a million) then maybe processing them in your application might be more efficient.

Getting the information if the input parameter is completely contained, could be done like this:

select ids, 
       ids @> array[1,2,3,4,5,6] as full_match
from idlist
where ids && array[1,2,3,4,5,6];

Another option would be to write your own function that calculates the match, or maybe use the intarray extension to efficiently calculate the intersection between two arrays.