Mysql – 1 million strings and exact match search

database-designmongodbMySQLPHP

I need to chose the best solution (software and algorithm) to solve my problem. I need to store up to million strings and data associated with them. Strings are search queries and data is cached search result. So the operations I'm going to perform are:

  • get a query from user
  • look for cached result in DB (exact match)
  • if found – return it
  • if not – perform a search and store the result

Some other facts:

  • As it is cache rows will expire over time.
  • Lookup and insert speed is a priority.
  • This string field is most probably the only one to search by.
  • I'm using PHP

Looks like hashing the string is a good idea, collisions are not a problem, if I get it I will just assume it's a cache miss and perform a new search. So I'm thinking of the following options:

  • MySQL with hash stored in binary field
  • MySQL with hash as string with index
  • MongoDB (maybe with capped collection instead of expiration time field)
  • some other solution (specialized key-value storage or something) I don't have experience with if it's good enough and worth learning for the project

I also have used Sphinx search when I needed almost the same thing but with partial match also. For exact search it seems to be an overkill.

So is hashing a good idea at all?
Which hash algorithm would you recommend?
Which option if better and why?
Does the answer change if I need to store 5 million rows?
What if I add hits field to get most popular rows from time to time?

Best Answer

PostgreSQL is perfectly fine for this. You got a couple of options to make this work.

First of all: PostgreSQL has a special index type for that called GIN (http://www.cybertec.at/gin-just-an-index-type/). It is perfect for full-text-search in general.

The cool thing is: In the latest version of PostgreSQL there is support for a thing called jsonb. You can put your string into a JSON document and search on EACH field in the JSON in a nice way using GIN and a couple of cool operators (see http://www.postgresql.org/docs/9.4/static/datatype-json.html). JSONB is really really fast, very powerful and it definitely kills MongoDB on the performance side. In addition to that a couple million of rows are no real big deal for PostgreSQL anyway. There is one more cool thing: PostgreSQL can do fuzzy search on strings (through Nearest-Neighbour search).