Mysql – Inconsistent performance of WHERE and ORDER BY depending on size of WHERE match

MySQL

Been working with MySQL for 15 years, and never seen anything quite like this before.

MySQL table with 4 million records.

I have found a very strange "sour spot" (opposite of "sweet spot") in performance, depending on how many rows are matched by a WHERE clause in conjunction with an ORDER BY clause.

If the number of rows is small, great performance. If the number of rows is huge, again great performance. But if the number of rows matched by WHERE is in the middle, around 8000 rows, suddenly, extremely poor performance.

Here's the query that I first noticed being slow:

SELECT name FROM lineageServer_lives
    WHERE  name LIKE 'Eve A%' ORDER BY generation DESC LIMIT 5;

There are 8978 rows that match this WHERE clause.

It ran for 50 seconds before I killed it. Then consider this query:

SELECT name FROM lineageServer_lives
    WHERE  name LIKE 'Eve Aa%' ORDER BY generation DESC LIMIT 5;

That WHERE clause matches fewer rows, only 1400, so we might expect performance to be better. Indeed, it is. That query completes in 0.06444050 seconds. That's a huge difference, though. Even if some sort was happening in the ORDER BY clause that was O(N^2), we wouldn't expect that big of a difference. Still, it's not that strange for the smaller one to be faster….

HOWEVER, here's the really weird part. Consider this query:

SELECT name FROM lineageServer_lives
    WHERE  name LIKE 'Eve%' ORDER BY generation DESC LIMIT 5;

That WHERE clause matches 154,119 rows. But performance is fine, completing in 0.15081400 seconds.

I'm wondering if MySQL is optimizing these queries in a strange way.

Does it order the rows first, and then walk through them looking for a WHERE match? In that case, a big WHERE match set would be better, because after ordering, it would find its 5 matching hits very quickly. On the other hand, if there were few WHERE matches, ordering the rows first would be slow.

If it finds matching rows with WHERE first, and then orders them after that, this would be fast on small WHERE matches, but slow on huge WHERE matches.

The fact that it's fast on huge WHERE matches and small WHERE matches, but slow on medium-size WHERE matches doesn't make sense, unless MYSQL is analyzing the query, picking the right approach in the extreme cases, but picking the wrong approach in the middle, on the boundary between the two cases.

By the way, there are currently indexes on the "names" and "generation" columns. I also tried a two-column INDEX( names, generation ), but that didn't help. I also tried INDEX( generation, names ), as a sanity check.

Here's EXPLAIN output for all three. Slow one, followed by two fast ones. You can see that MySQL is taking the same approach for the first two, and a different approach for the last one.

mysql> explain SELECT name FROM lineageServer_lives WHERE  name LIKE 'Eve A%' ORDER BY generation DESC LIMIT 5;
+----+-------------+---------------------+------------+-------+---------------+------------+---------+------+------+----------+-------------+
| id | select_type | table               | partitions | type  | possible_keys | key        | key_len | ref  | rows | filtered | Extra       |
+----+-------------+---------------------+------------+-------+---------------+------------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | lineageServer_lives | NULL       | index | name          | generation | 9       | NULL | 1343 |     0.37 | Using where |
+----+-------------+---------------------+------------+-------+---------------+------------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql> explain SELECT name FROM lineageServer_lives WHERE  name LIKE 'Eve%' ORDER BY generation DESC LIMIT 5;
+----+-------------+---------------------+------------+-------+---------------+------------+---------+------+------+----------+-------------+
| id | select_type | table               | partitions | type  | possible_keys | key        | key_len | ref  | rows | filtered | Extra       |
+----+-------------+---------------------+------------+-------+---------------+------------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | lineageServer_lives | NULL       | index | name          | generation | 9       | NULL |   71 |     7.02 | Using where |
+----+-------------+---------------------+------------+-------+---------------+------------+---------+------+------+----------+-------------+
1 row in set, 1 warning (1.65 sec)

mysql> explain SELECT name FROM lineageServer_lives WHERE  name LIKE 'Eve Aa%' ORDER BY generation DESC LIMIT 5;
+----+-------------+---------------------+------------+-------+---------------+------+---------+------+------+----------+---------------------------------------+
| id | select_type | table               | partitions | type  | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                 |
+----+-------------+---------------------+------------+-------+---------------+------+---------+------+------+----------+---------------------------------------+
|  1 | SIMPLE      | lineageServer_lives | NULL       | range | name          | name | 256     | NULL | 1400 |   100.00 | Using index condition; Using filesort |
+----+-------------+---------------------+------------+-------+---------------+------+---------+------+------+----------+---------------------------------------+
1 row in set, 1 warning (0.00 sec)

Also note the differences in the filter column.

Thus, I believe that for the last one, MySQL is testing the WHERE first using the "name" index, seeing only 1400 results, and then saying, "Sweet, not that many, now we can filesort to get the order by and limit." In the previous two, it sees too many result from the WHERE clause to filesort them, and decides instead to walk through the rows in ORDER BY order first, and look for the first 5 WHERE matches that it can find.

Actually, if that's true, then it should get faster if I ask for fewer matches… Hmm… not faster, took 96 seconds:

96.20525475 | SELECT name FROM lineageServer_lives
    WHERE  name LIKE 'Eve A%' ORDER BY generation DESC LIMIT 1

The problem here is that the data isn't randomly distributed. For names that start with "Eve A", they have low generation values. Thus, to sort by generation first and then walk through looking for "Eve A" top-down, we have to walk through all of the 4M records. By changing the order of generation, and filtering out a default value of -1 that is very common in the table, we get better results (0.00262775 seconds):

SELECT name FROM lineageServer_lives
    WHERE  name LIKE 'Eve A%' and generation > 0 ORDER BY generation ASC LIMIT 1;

Of course, these are in the wrong order, and not the results I want. But it helps to explain what's going on (these "Eve A" matches are at the end of the list when sorting the other way, so they're very slow to find).

If I force the index on the "name" column, performance improves:

SELECT name FROM lineageServer_lives force index( name )
    WHERE  name LIKE 'Eve%' ORDER BY generation DESC LIMIT 5

This runs in 0.2 seconds. MySQL finds all matching rows based on the name column and then filesorts the results based on the generation column. However, performance degrades as the WHERE clause hits more and more rows, so it's not a general-purpose solution.

Now that I understand what's going on here, I will close this question. Hopefully, someone else will benefit from this analysis in the future.

Best Answer

Since the query mentions only name and generation, either of these would be "covering", hence potentially better (maybe a little, maybe a not) than single-column indexes:

INDEX(name, generation)
INDEX(generation, name)

Other than that, the Optimizer is caught between a rock and a hard place.

  • If it picks (name, ...) but there are more matching names than estimated, then it is slowed by the temp table, it may be the wrong choice.

  • If it picks (generation, ...) but the desired rows are thousands of rows into the scan, then it may be the wrong choice. Changing the LIMIT value could change how many rows need to be scanned.

The way the Optimizer does the analysis changed to "cost based", which is often better than the old algorithms, but still not good enough to make the 'right' answer in all cases.

In a sense, this is a "2-dimensional index" problem. There are no 2D indexes except (sort of) SPATIAL. But that seems impractical here. PARTITION BY RANGE(name) sort of gives a dimension, then an index starting with generation could be more efficient since it only needs to look in a single partition. This is one of the few use cases for PARTITIONing.

FORCE INDEX may work today, but mess up tomorrow. And your slight changes exhibit that.

MariaDB (ver 10.0+) has "histograms" for its indexes, and might pick the better query plan more often.

ICP (Index Condition Pushdown; EXPLAIN: "Using index condition") was introduced in 5.6.