Mysql – How expensive are SELECT FROM LIKE statements from a design perspective

MySQLperformancequery-performanceselect

I have a situation running out of control regarding an ever increasing number of small static files I need to access randomly in a large directory structure. I MUST reduce the number of those files drastically very very soon. I am investigating for quick solutions to release the pressure.

One option is to move the file's content (UTF8 text) into a database and perform SELECTs to replace the file search (by name). The select statements would be the following:

SELECT TOP(1) MyContent FROM MyTable WHERE MyContentName LIKE 'criteria%';
SELECT TOP(1) MyContent FROM MyTable WHERE MyContentName LIKE '%othercriteria';
SELECT TOP(1) MyContent FROM MyTable WHERE MyContentName LIKE '%andanothercriteria%';

We are talking about a total of 200K requests per day on a 800K rows table (which I could easily split in two if it helps). MyContentName is part of the key and will be indexed. There is either one entry matching the criteria in the table or none.

I am no DB admin expert. Is this something a MySQL instance on a shared server can support or are my expectations too high?

I know the typical answer is: I should test. Unfortunately, because of the emergency, I don't have time to perform tests. I need to find a quick solution, even if temporary, even if it degrades service response delays a little.

I am looking for experienced DB admin's opinion on this strategy. Tips and suggestions are welcome too.

Best Answer

If you are not able to fine tune the filesystem (for instance, by using a smaller block size), and you really must use a database, I'd suggest the following readings:

The first will explain the data structure most commonly used by indexes, the B-Tree. The second explains how MySQL uses the B-Tree. The third will tell you about the command EXPLAIN SELECT ..., which is how MySQL describes the query plan (it will tell you which (if any) index it is using, if it is doing table scans -- which you must avoid at all costs).

To create an optimized index, you should first think about the structure of the query (or queries) you will need. For example, it might be something like: select content from files where firstParameter = XXX and secondParameter like 'xxx%'.

You should analyze the cardinality of each column (ie, how many different values can the column have).

You chose the columns with highest cardinality to be the first in an index, leaving those with lower cardinality to the end. Example: suppose you have 2M rows and firstParameter is a number between 1 and 1M, randomly distributed, and secondParameter is the full name of the owner of the file. In such a situation, you want the index (firstParameter, secondParameter), in this very order, since the clause firstParameter = XXX will leave you, in average, with only 2 rows. The cardinality of secondParameter, on the other hand, is much lower: there are much less then 1M possibilities for people's names. Therefore, if your index is (secondParameter, firstParameter), the query where firstParameter = 1 and secondParameter like 'bruno%' will first look for every row in which secondParameter begins with bruno (which will be likely tens or hundreds of thousands), and only then it will look for the other condition.

Also, note that the index is used from left to right. That is, if you have 3 columns, A, B and C and you index (A, B, C), the index will be mostly useless in a query such as where A = 1 and C = 2. It will likely be used to find rows matching A = 1, but after that every row will be checked for C = 2. If most of your queries are like that one (and some might specify B as well), than your index should be (A, C, B).

Finally, note that like 'xxx%' can use an index, while like '%xxx' (or like '%xxx%') cannot. This is, again, because the index is read from left to right. To match xxx%, it knows where to start looking; to match %xxx it must check every single row.

All that said about indexes, I'd strongly suggest you to rework your criteria so that you have something more structured. As you said, you could try to precompute something.

There are other considerations, such as the size of the content. If you can make it fit in under 8KB (which is like 3000 characters if you use UTF-8), then InnoDB will store the data in the same page as the primary key; otherwise, it will store the data elsewhere. If you query by primary key, in the first case you have a single read operation; if you query by another index, in the second case you have 3 read operations: one to find the primary key of matching row, one to find the row by primary key (to read the address of the data) and one to read the data.

Oh, check the amount of RAM of your server. Ideally your data (or at the very least your indexes) should fit in the RAM.

By considering all these points, you should have no problem at all: I don't know the hardware of your server, or its load (since you said it is shared), but 800k rows is close to nothing if you fine tune your indexes; I'm very far from expert and, by doing all the above things, I work daily with (very optimized) tables with 10M, some 100M rows, and the queries are ultra-fast.

I hope that helps. Once you have your table(s), you could ask another question showing the create table statement and describing a bit about your data (sizes, cardinalities, etc) and the select queries you will use, so someone could help you to create an optimized index.