MySQL Performance – Improve Query Performance on Large Table with Composite Index


Assuming that I have a MySQL table with ~ 30 million entries and 40 columns I have an highly active query (5 queries/second) which is quite slow (avg. ~ 20 seconds) and has a high number of rows scanned (avg. 50.000 rows). Performance is getting worse and worse with the table growing. I want to solve the problem by adding the correct composite or even covering index.

The doctrine query is built by a dynamic query builder and involves the following properties (only userId is used in any query, all other columns are only used for filtering sometimes):

  • Always: user_id int with = [> 1 m users, but single users may have > 200K entries]
  • Sometimes: status varchar(20) with IN() [7 possibilities]
  • Sometimes: expiration_timestamp datetime with < [can be any timestamp]
  • Sometimes: type varchar(20)( with IN() [7 possibilities]
  • Rare: name varchar(255) with LIKE [with trailing wildcard, rarely repetitive]
  • Very rare: tags varchar(2000) with LIKE [with leading and trailing wildcard]
  • Often: orderBy id int DESC [id is the primary key, the orderBy is necessary]

Without having tested it (will require a production deployment with maintenance window including short downtime) I would propose the following solution:

CREATE INDEX listing ON items(user_id,status,type,name,expiration_timestamp,id);

Here is my reasoning: First of all, the user_id is always used with an equality comparison, so this should be first. status and type have an IN clause, therefore they should be second. The third one is name, because even if LIKE with trailing wildcard is used it is highly selective. Indexing the expiration_timestamp will help to significantly reduce the number of results. As MySQL uses indices for ordering it makes sense to put the id at the end of the composite index. There is no reason to put tags in the index, because an index on a LIKE with a leading wildcard is useless.

Is this the correct approach or would you recommend to improve something here?

One fact I am not sure about furthermore: In case the query e.g. is without type or status, will MySQL be "intelligent" enough to use my composite index anyway? Still quite new to MySQL indexing, thank you for your help!

Best Answer

There are several problems with such an index.

All the conditions you describe other than user_id = ? are considered range conditions. A range condition is any condition that matches multiple values in each case. So <, IN(), LIKE, are all range conditions.

Here's the first problem: in a composite index, only one column involved in a range condition will be used.

Example: Suppose you had an index on (a,b,c) in a hypothetical table.

SELECT ... WHERE a = 1 AND b < 10 AND c IN (1,2,3)

This would use only the (a,b) columns of the index. After the first column used in a range condition, conditions for the subsequent column(s) of the index would need to be evaluated "the hard way," row by row.

Actually, there's one mitigation for that, which is index condition pushdown. This happens automatically. But this isn't as good as a real index lookup.

A second problem is that the columns used in the index must be consecutive. If you try to "skip" a column, it can't use the column from the index.


SELECT ... WHERE a = 1 AND c IN (1,2,3)

I said you can have one column in addition to the columns used for equality, which this example query appears to satisfy. But if the index is on columns (a,b,c), but no condition on b is in this query, then the condition on column c can't use the index either.

A third problem is that ORDER BY optimization is also spoiled by having any range condition in the query. That is, once the query does a range condition, the sort order isn't implicit from the index order.

So the bottom line is that given your dynamic queries, with a mix of different conditions which may or may not be included in a given run, you can't make a single composite index that will satisfy all cases.

What you could do is create several composite indexes:

(user_id, status)
(user_id, expiration_timestamp)
(user_id, type) 
(user_id, name)

Then let the optimizer pick the one that is most relevant to a given query, based on the dynamic conditions included.

But regardless, ORDER BY id will require a filesort.