How to store data with a query that’s approximated

database-design

I'm trying to find a way to store my data with fast access (better than O(n)).

My database consists of data (4096 byte strings) that represents some information about some items.
The problem is, that the query is never exact. I get one Item, and then need to find the closest match using a function F(a,b).

just an example:

1234
3456
6466
F(a,b) = return % of similar digits  

GetClosest(1233,F) = 1234

The problem is that F(a,b) is a complicated algorithm, (not a proper metric).

What I have now is just go over the whole database to search for the best match.
Is there a kind of tree or other cluster database type that can give me faster finding complexity ?

More information:

F gives back a similarity value in %percentage. where 100% is a perfect match.

Best Answer

You may want to try the retrieval (re trie val) tree, as known as the TRIE. Some refer to this as a Radix Tree.

The idea is to create tree node structures that contain branches for every character your data field can possibly contain.

Let's use a simple case, a numeric field. Obviously, the character range is 0-9. Each tree node would contain ten(10) branches. Let's take the worst case for a 4 byte unsigned integer, 2^32 - 1, which is 4294967295. What is its length? Just compute the length by take the integer of the log base 10 of 4294967295 and adding 1.

mysql> select floor(log10(power(2,32)) + 1);
+-------------------------------+
| floor(log10(power(2,32)) + 1) |
+-------------------------------+
|                            10 |
+-------------------------------+
1 row in set (0.00 sec)

So, you would have a TRIE with a maximum height of 10. Starting at the root of the TRIE, if you have the number 4294967295, you traverse branches 4,2,9,4,9,6,7,2,9, and 5. At each branch, you would perform an array-styled binary search.

If the branch located at that TRIE node is an exact match, you can assign a percentage to that level, and recursively walk down that branch to check for the next digit and return percentages from deeper TRIE nodes to add to the percentage you have at the currently searched TRIE node.

If the branch located at that TRIE node is NOT an exact match, you stop your recursive search there and return either 0 or some other percentage you may want to designate.

Given the sum of return values from all searched TRIE nodes, you may want to sum up the percentages and divide that answer by the length of the string. In other words,

Pct per Node = (1 / (Number of TRIE nodes that need to be searched)) or Zero(0).

Sum(Pct) = (Number of TRIE nodes exactly matched) / (Number of TRIE nodes that need to be searched [length of the string being searched]).

Given the length of the numberic field you store, you have O(log n) due to field length. For each TRIE node, you have O(log n) for searching for the proper branch. Overall, your search should have O(log (log n)) search time.

This performance stands out ever more if the field is alphanumeric. Assuming using only ASCII, each TRIE node would have 256 branches. The height of the TRIE would depend on the length of the character field. Representing this TRIE for variable-length strings would produce TRIE nodes that would be very sparse, but quickly searchable nonetheless.

Regardless what database you use, carefuly plan the data types you will be using to represent the TRIE node. You may also want to partition the table so that strings of length n terminate in partition n. Thus, you will have O(log n) search time at each partition.

http://en.wikipedia.org/wiki/Trie

http://www.eecs.harvard.edu/~ellard/Q-97/HTML/root/node24.html

http://www.webreference.com/js/tips/000318.html

http://en.wikipedia.org/wiki/Radix_tree

Related Solutions

How does Snapchat store its data

All good, future-proofed databases are normalised to some extent. Normalisation is described by Louis Davidson ~ Pro SQL Server 2008 Relational Database Design and Implementation as

The process of normalization is really quite straightforward: take entities that are complex and extract simpler entities from them. The process continues until every table in the database represents one thing and every column describes that thing.

So based on your example I would imagine that the Snap Chat database (assuming that we are talking about OLTP model and not using NOSQL / BIGData) would have multiple entities, with each entity containing relevant data and be related to other entities via the appropriate relationship i.e. 1..Many, Many to Many etc. For example, a person entity might contain UserName and Password and be related to a loginToSite entity via a 1..M relationship. I suggest a 1..M relationship because one person can log in to the site multiple times but each login to site record will only be related to one user. If all attributes were stored in the a single entity then in this example the UserName and Password information would be duplicated per Login record and break the rules of normalisation. This would also lead to a database usability nightmare!!

Depending on business rules, and these would be questions you would have to ask snap chat. One user record might be related to one or many photo records or maybe one user can be related to one or many photo records and one photo record can be related to one or many user records. In this case, there would be a many to many relationship.

When thinking about database design it's important to first work out the conceptual design by talking with end users, reviewing reports / spread-sheets (anything containing data used by the end user) and generating an ERD. Then once the relationships have been agreed, all data to be stored captured and the ERD modelled you can move onto the physical side of things and work on performance, choosing data types, deciding on indexes etc...

Database Design – Best Design to Reference Multiple Tables from Single Column

I personally don't like to use a multi-table schema for this purpose.

It's hard to ensure integrity.
It's hard to maintain.
It's difficult to filter results.

I've set a dbfiddle sample.

My proposed table schema:

CREATE TABLE #Brands
(
BrandId int NOT NULL PRIMARY KEY,
BrandName nvarchar(100) NOT NULL 
);

CREATE TABLE #Clothes
(
ClothesId int NOT NULL PRIMARY KEY,
ClothesName nvarchar(100) NOT NULL 
);

-- Lookup table for known attributes
--
CREATE TABLE #Attributes
(
AttrId int NOT NULL PRIMARY KEY,
AttrName nvarchar(100) NOT NULL 
);

-- holds common propeties, url, price, etc.
--
CREATE TABLE #BrandsClothes
(
BrandId int NOT NULL REFERENCES #Brands(BrandId),
ClothesId int NOT NULL REFERENCES #Clothes(ClothesId),
VievingUrl nvarchar(300) NOT NULL,
Price money NOT NULL,
PRIMARY KEY CLUSTERED (BrandId, ClothesId),
INDEX IX_BrandsClothes NONCLUSTERED (ClothesId, BrandId)
);

-- holds specific and unlimited attributes 
--
CREATE TABLE #BCAttributes
(
BrandId int NOT NULL REFERENCES #Brands(BrandId),
ClothesId int NOT NULL REFERENCES #Clothes(ClothesId),
AttrId int NOT NULL REFERENCES #Attributes(AttrId),
AttrValue nvarchar(300) NOT NULL,
PRIMARY KEY CLUSTERED (BrandId, ClothesId, AttrId),
INDEX IX_BCAttributes NONCLUSTERED (ClothesId, BrandId, AttrId)
);

Let me insert some data:

INSERT INTO #Brands VALUES 
(1, 'Brand1'), (2, 'Brand2');

INSERT INTO #Clothes VALUES 
(1, 'Pants'), (2, 'T-Shirt');

INSERT INTO #Attributes VALUES
(1, 'Color'), (2, 'Size'), (3, 'Shape'), (4, 'Provider'), (0, 'Custom');

INSERT INTO #BrandsClothes VALUES
(1, 1, 'http://mysite.com?B=1&C=1', 123.99),
(1, 2, 'http://mysite.com?B=1&C=2', 110.99),
(2, 1, 'http://mysite.com?B=2&C=1', 75.99),
(2, 2, 'http://mysite.com?B=2&C=2', 85.99);

INSERT INTO #BCAttributes VALUES
(1, 1, 1, 'Blue, Red, White'),
(1, 1, 2, '32, 33, 34'),
(1, 2, 1, 'Pearl, Black widow'),
(1, 2, 2, 'M, L, XL'),
(2, 1, 4, 'Levis, G-Star, Armani'),
(2, 1, 3, 'Slim fit, Regular fit, Custom fit'),
(2, 2, 4, 'G-Star, Armani'),
(2, 2, 3, 'Slim fit, Regular fit'),
(2, 2, 0, '15% Discount');

If you need to fetch common attributes:

SELECT     b.BrandName, c.ClothesName, bc.VievingUrl, bc.Price
FROM       #BrandsClothes bc
INNER JOIN #Brands b
ON         b.BrandId = bc.BrandId
INNER JOIN #Clothes c
ON         c.ClothesId = bc.ClothesId
ORDER BY   bc.BrandId, bc.ClothesId;

BrandName   ClothesName   VievingUrl                  Price
---------   -----------   -------------------------   ------
Brand1      Pants         http://mysite.com?B=1&C=1   123.99
Brand1      T-Shirt       http://mysite.com?B=1&C=2   110.99
Brand2      Pants         http://mysite.com?B=2&C=1    75.99
Brand2      T-Shirt       http://mysite.com?B=2&C=2    85.99

Or you can easily get Clothes by Brand:

Give me all clothes of Brand2

SELECT     c.ClothesName, b.BrandName, a.AttrName, bca.AttrValue
FROM       #BCAttributes bca
INNER JOIN #BrandsClothes bc
ON         bc.BrandId = bca.BrandId
AND        bc.ClothesId = bca.ClothesId
INNER JOIN #Brands b
ON         b.BrandId = bc.BrandId
INNER JOIN #Clothes c
ON         c.ClothesId = bc.ClothesId
INNER JOIN #Attributes a
ON         a.AttrId = bca.AttrId
WHERE      bca.ClothesId = 2
ORDER BY   bca.ClothesId, bca.BrandId, bca.AttrId;

ClothesName   BrandName   AttrName   AttrValue
-----------   ---------   --------   ---------------------
T-Shirt       Brand1      Color      Pearl, Black widow
T-Shirt       Brand1      Size       M, L, XL
T-Shirt       Brand2      Custom     15% Discount
T-Shirt       Brand2      Shape      Slim fit, Regular fit
T-Shirt       Brand2      Provider   G-Star, Armani

But for me, one of the best of this schema is that you can filter by Attibutes:

Give me all Clothes that has the attribute: Size

SELECT     c.ClothesName, b.BrandName, a.AttrName, bca.AttrValue
FROM       #BCAttributes bca
INNER JOIN #BrandsClothes bc
ON         bc.BrandId = bca.BrandId
AND        bc.ClothesId = bca.ClothesId
INNER JOIN #Brands b
ON         b.BrandId = bc.BrandId
INNER JOIN #Clothes c
ON         c.ClothesId = bc.ClothesId
INNER JOIN #Attributes a
ON         a.AttrId = bca.AttrId
WHERE      bca.AttrId = 2
ORDER BY   bca.ClothesId, bca.BrandId, bca.AttrId;

ClothesName   BrandName   AttrName   AttrValue
-----------   ---------   --------   ----------
Pants         Brand1      Size       32, 33, 34
T-Shirt       Brand1      Size       M, L, XL

Using a multi-table schema whatever of the previous queries will require to deal with an unlimited number of tables, or with XML or JSON fields.

Another option with this schema, is that you can define templates, for example, you could add a new table BrandAttrTemplates. Every time you add a new record you can use a trigger or a SP to generate a set of a predefined attributes for this Branch.

I'm sorry, I'd like to extend my explanations by I think it is more clear than my English.

Update

My current answer should works on no matter which RDBMS. According to your comments, if you need to filter attributes values I'd suggest small changes.

As far as MS-Sql doesn't allow arrays, I've set up a new sample mantaining same table schema, but changing AttrValue to an ARRAY field type.

In fact, using POSTGRES, you can take advantatge of this array using a GIN index.

(Let me say that @EvanCarrol has a good knowledge about Postgres, certainly better than me. But let me add my bit.)

CREATE TABLE BCAttributes
(
BrandId int NOT NULL REFERENCES Brands(BrandId),
ClothesId int NOT NULL REFERENCES Clothes(ClothesId),
AttrId int NOT NULL REFERENCES Attrib(AttrId),
AttrValue text[],
PRIMARY KEY (BrandId, ClothesId, AttrId)
);

CREATE INDEX ix_attributes on BCAttributes(ClothesId, BrandId, AttrId);
CREATE INDEX ix_gin_attributes on BCAttributes using GIN (AttrValue);


INSERT INTO BCAttributes VALUES
(1, 1, 1, '{Blue, Red, White}'),
(1, 1, 2, '{32, 33, 34}'),
(1, 2, 1, '{Pearl, Black widow}'),
(1, 2, 2, '{M, L, XL}'),
(2, 1, 4, '{Levis, G-Star, Armani}'),
(2, 1, 3, '{Slim fit, Regular fit, Custom fit}'),
(2, 2, 4, '{G-Star, Armani}'),
(2, 2, 3, '{Slim fit, Regular fit}'),
(2, 2, 0, '{15% Discount}');

Now, you can additionally query using individual attributes values like:

Give me a list of all pants Size:33

AttribId = 2 AND ARRAY['33'] && bca.AttrValue

SELECT     c.ClothesName, b.BrandName, a.AttrName, array_to_string(bca.AttrValue, ', ')
FROM       BCAttributes bca
INNER JOIN BrandsClothes bc
ON         bc.BrandId = bca.BrandId
AND        bc.ClothesId = bca.ClothesId
INNER JOIN Brands b
ON         b.BrandId = bc.BrandId
INNER JOIN Clothes c
ON         c.ClothesId = bc.ClothesId
INNER JOIN Attrib a
ON         a.AttrId = bca.AttrId
WHERE      bca.AttrId = 2
AND        ARRAY['33'] && bca.AttrValue
ORDER BY   bca.ClothesId, bca.BrandId, bca.AttrId;

This is the result:

clothes name | brand name | attribute | values 
------------- ------------ ----------  ---------------- 
Pants          Brand1       Size        32, 33, 34

Best Answer

Related Solutions

How does Snapchat store its data

Database Design – Best Design to Reference Multiple Tables from Single Column

Related Question