How to store data with a query that’s approximated

database-design

I'm trying to find a way to store my data with fast access (better than O(n)).

My database consists of data (4096 byte strings) that represents some information about some items.
The problem is, that the query is never exact. I get one Item, and then need to find the closest match using a function F(a,b).

just an example:

1234
3456
6466
F(a,b) = return % of similar digits  

GetClosest(1233,F) = 1234

The problem is that F(a,b) is a complicated algorithm, (not a proper metric).

What I have now is just go over the whole database to search for the best match.
Is there a kind of tree or other cluster database type that can give me faster finding complexity ?

More information:

F gives back a similarity value in %percentage. where 100% is a perfect match.

Best Answer

You may want to try the retrieval (re trie val) tree, as known as the TRIE. Some refer to this as a Radix Tree.

The idea is to create tree node structures that contain branches for every character your data field can possibly contain.

Let's use a simple case, a numeric field. Obviously, the character range is 0-9. Each tree node would contain ten(10) branches. Let's take the worst case for a 4 byte unsigned integer, 2^32 - 1, which is 4294967295. What is its length? Just compute the length by take the integer of the log base 10 of 4294967295 and adding 1.

mysql> select floor(log10(power(2,32)) + 1);
+-------------------------------+
| floor(log10(power(2,32)) + 1) |
+-------------------------------+
|                            10 |
+-------------------------------+
1 row in set (0.00 sec)

So, you would have a TRIE with a maximum height of 10. Starting at the root of the TRIE, if you have the number 4294967295, you traverse branches 4,2,9,4,9,6,7,2,9, and 5. At each branch, you would perform an array-styled binary search.

If the branch located at that TRIE node is an exact match, you can assign a percentage to that level, and recursively walk down that branch to check for the next digit and return percentages from deeper TRIE nodes to add to the percentage you have at the currently searched TRIE node.

If the branch located at that TRIE node is NOT an exact match, you stop your recursive search there and return either 0 or some other percentage you may want to designate.

Given the sum of return values from all searched TRIE nodes, you may want to sum up the percentages and divide that answer by the length of the string. In other words,

Pct per Node = (1 / (Number of TRIE nodes that need to be searched)) or Zero(0).

Sum(Pct) = (Number of TRIE nodes exactly matched) / (Number of TRIE nodes that need to be searched [length of the string being searched]).

Given the length of the numberic field you store, you have O(log n) due to field length. For each TRIE node, you have O(log n) for searching for the proper branch. Overall, your search should have O(log (log n)) search time.

This performance stands out ever more if the field is alphanumeric. Assuming using only ASCII, each TRIE node would have 256 branches. The height of the TRIE would depend on the length of the character field. Representing this TRIE for variable-length strings would produce TRIE nodes that would be very sparse, but quickly searchable nonetheless.

Regardless what database you use, carefuly plan the data types you will be using to represent the TRIE node. You may also want to partition the table so that strings of length n terminate in partition n. Thus, you will have O(log n) search time at each partition.

http://en.wikipedia.org/wiki/Trie

http://www.eecs.harvard.edu/~ellard/Q-97/HTML/root/node24.html

http://www.webreference.com/js/tips/000318.html

http://en.wikipedia.org/wiki/Radix_tree