Database Design – Finding Closest Matching JSON Entry

database-design

Scenario: suppose I have a list of numbers written in something like JSON stored in a database field. Something like this:

ids{1,2,3}

Let's suppose the database is filled with rows that each have a an IDs JSON field like this

entry1: someUniqueID, ids{1,2,4}

entry2: someUniqueID, ids{1,2,5}

entry3: someUniqueID, ids{1,2,3}

Is it possible with any type of database (sql, nosql, graph etc.) to efficiently structure the database for finding the entry with the closest match? As in
looking for the field ids inside a database that most closely resembles

ids{1,2,3}

entry1 has 2 numbers inside that match which are 1 and 2. therefore 66,6% match

entry2 has 2 numbers inside that match which are 1 and 2. therefore 66,6% match

entry3 has 3 numbers inside that match which are 1,2 and 3. therefore 100% match

the Query would return entry3

I know it's of course possible to read each row and compare the values using code and find the closest match. Of course if you have a large database with many thousands or even millions of rows each holding the ids field. Getting result would take ages.

Is there any database technology that would allow for fast, sub 1 seconds result of this kind, even if ids holds much more than 3 values.

By the way the values don't have to be stored using JSON. Any Direction to for me to look into would be appreciated.

Best Answer

Just a rough sketch on how you could do that with Postgres arrays.

There are two array operators that you might be interested in. The "overlaps" operator that returns true if some of the elements are the same and the "contains" operator that checks if all elements of one array are contained in the other - that would be your 100% match:

Assuming this table:

create table idlist (ids int[]);

You can create an index on that column:

create index on idlist using gin (ids);

A query to find the 100% match looks like this:

select *
from idlist
where ids @> array[1,2,3,4,5,6]

That will return only rows where ids contains all the values 1,2,3,4,5,6 (but it might contain more).

On a table with 2 million rows and arrays from 5 to 100 elements this takes about 20ms on a pretty old test server (the actual runtime depends on the total amount of matches found)

The query to find rows that contain at least one of the values of the parameter uses the && operator and is substantially slower.

select *
from idlist
where ids && array[1,2,3,4,5,6];

That takes about 300ms on my test server. But it also returns 500.000 rows! If the condition only returns very few rows (less than thousand) then obviously this is going to be a lot faster.

How you proceed from there is a matter on how restrictive the conditions are that you use. If the && operator only returns a few rows (not half a million) then maybe processing them in your application might be more efficient.

Getting the information if the input parameter is completely contained, could be done like this:

select ids, 
       ids @> array[1,2,3,4,5,6] as full_match
from idlist
where ids && array[1,2,3,4,5,6];

Another option would be to write your own function that calculates the match, or maybe use the intarray extension to efficiently calculate the intersection between two arrays.

Related Solutions

Choosing the correct database type for single type of entry

NoSQL databases are for unstructured data or unpredictably structured data that is often queried using full-text searching.

Capturing time-series statistics for graphing sounds like a pretty well-structured data set to me. What do you have, three tables? Users, Servers and Measurements? It sounds to me like you have data which is perfectly suited to a relational database.

It may be that you need to think about whether your relational database is going to be fully normalized or partially denormalized for reporting performance. That's something you can decide is necessary once you do some load testing with production volumes in order to see whether denormalization is helpful.

SQL Server – Designing Database with Only Text Fields

Size is one consideration. An int can hold up to -2,147,483,648 in four bytes. A char will need 11 bytes to hold the same value.

There are built-in functions to manipulate the various data types. DATEADD() and DATEDIFF() are two examples. This will not be possible with date-stored-as-text. Constantly CASTing back and forth will not make for efficient processing, or legible code.

Automatic validation is another foregone benefit with the all-text approach. You may think that a column contains dates but there will be nothing to stop someone entering the value '2014-13-97'.

Sorting is unlikely to give the intended result with columns which are "really" numbers. For example, if a column contained integers '1' through '100', and the query sorted by this column, one would expect the result to be

1, 2 .. 10, 11 .. 20, 21 .. 100

However, the actual result is more likely to be

1, 10, 100, 11, 12, 2, 20, 21 ...

There may be similar concerns for dates, depending on the chosen character representation. Again type casting can cure this with the costs mentioned previously.

Occasionally you will find strings which contain digits only. Examples are national identity numbers, bank account numbers, phone numbers and such like. The common characteristic of such values is that it does not make sense to perform calculations on them. Other things being equal it may be OK to store these as char() columns, especially if they have optional embedded alpha characters, but the above considerations still apply.

Best Answer

Related Solutions

Choosing the correct database type for single type of entry

SQL Server – Designing Database with Only Text Fields

Related Question