Mysql – 1 million strings and exact match search

database-designmongodbMySQLPHP

I need to chose the best solution (software and algorithm) to solve my problem. I need to store up to million strings and data associated with them. Strings are search queries and data is cached search result. So the operations I'm going to perform are:

get a query from user
look for cached result in DB (exact match)
if found – return it
if not – perform a search and store the result

Some other facts:

As it is cache rows will expire over time.
Lookup and insert speed is a priority.
This string field is most probably the only one to search by.
I'm using PHP

Looks like hashing the string is a good idea, collisions are not a problem, if I get it I will just assume it's a cache miss and perform a new search. So I'm thinking of the following options:

MySQL with hash stored in binary field
MySQL with hash as string with index
MongoDB (maybe with capped collection instead of expiration time field)
some other solution (specialized key-value storage or something) I don't have experience with if it's good enough and worth learning for the project

I also have used Sphinx search when I needed almost the same thing but with partial match also. For exact search it seems to be an overkill.

So is hashing a good idea at all?
Which hash algorithm would you recommend?
Which option if better and why?
Does the answer change if I need to store 5 million rows?
What if I add hits field to get most popular rows from time to time?

Best Answer

PostgreSQL is perfectly fine for this. You got a couple of options to make this work.

First of all: PostgreSQL has a special index type for that called GIN (http://www.cybertec.at/gin-just-an-index-type/). It is perfect for full-text-search in general.

The cool thing is: In the latest version of PostgreSQL there is support for a thing called jsonb. You can put your string into a JSON document and search on EACH field in the JSON in a nice way using GIN and a couple of cool operators (see http://www.postgresql.org/docs/9.4/static/datatype-json.html). JSONB is really really fast, very powerful and it definitely kills MongoDB on the performance side. In addition to that a couple million of rows are no real big deal for PostgreSQL anyway. There is one more cool thing: PostgreSQL can do fuzzy search on strings (through Nearest-Neighbour search).

Related Solutions

How to store data with a query that’s approximated

You may want to try the retrieval (re trie val) tree, as known as the TRIE. Some refer to this as a Radix Tree.

The idea is to create tree node structures that contain branches for every character your data field can possibly contain.

Let's use a simple case, a numeric field. Obviously, the character range is 0-9. Each tree node would contain ten(10) branches. Let's take the worst case for a 4 byte unsigned integer, 2^32 - 1, which is 4294967295. What is its length? Just compute the length by take the integer of the log base 10 of 4294967295 and adding 1.

mysql> select floor(log10(power(2,32)) + 1);
+-------------------------------+
| floor(log10(power(2,32)) + 1) |
+-------------------------------+
|                            10 |
+-------------------------------+
1 row in set (0.00 sec)

So, you would have a TRIE with a maximum height of 10. Starting at the root of the TRIE, if you have the number 4294967295, you traverse branches 4,2,9,4,9,6,7,2,9, and 5. At each branch, you would perform an array-styled binary search.

If the branch located at that TRIE node is an exact match, you can assign a percentage to that level, and recursively walk down that branch to check for the next digit and return percentages from deeper TRIE nodes to add to the percentage you have at the currently searched TRIE node.

If the branch located at that TRIE node is NOT an exact match, you stop your recursive search there and return either 0 or some other percentage you may want to designate.

Given the sum of return values from all searched TRIE nodes, you may want to sum up the percentages and divide that answer by the length of the string. In other words,

Pct per Node = (1 / (Number of TRIE nodes that need to be searched)) or Zero(0).

Sum(Pct) = (Number of TRIE nodes exactly matched) / (Number of TRIE nodes that need to be searched [length of the string being searched]).

Given the length of the numberic field you store, you have O(log n) due to field length. For each TRIE node, you have O(log n) for searching for the proper branch. Overall, your search should have O(log (log n)) search time.

This performance stands out ever more if the field is alphanumeric. Assuming using only ASCII, each TRIE node would have 256 branches. The height of the TRIE would depend on the length of the character field. Representing this TRIE for variable-length strings would produce TRIE nodes that would be very sparse, but quickly searchable nonetheless.

Regardless what database you use, carefuly plan the data types you will be using to represent the TRIE node. You may also want to partition the table so that strings of length n terminate in partition n. Thus, you will have O(log n) search time at each partition.

http://en.wikipedia.org/wiki/Trie

http://www.eecs.harvard.edu/~ellard/Q-97/HTML/root/node24.html

http://www.webreference.com/js/tips/000318.html

http://en.wikipedia.org/wiki/Radix_tree

Mysql – Recommended structure? %70 write %30 read. 10M row. 200 query/sec. Select, update, insert, search

Switch to InnoDB, and

Set innodb_buffer_pool_size to at least 250M
Show us the tentative SELECTs
SHOW CREATE TABLE (DESC is inadequate)

Best Answer

Related Solutions

How to store data with a query that’s approximated

Mysql – Recommended structure? %70 write %30 read. 10M row. 200 query/sec. Select, update, insert, search

Related Question