MySQL – Why Bulk Multi-Column Key Queries Are Slow

indexindex-tuningMySQLoptimizationperformancequery-performance

(For this question, I am using AWS/Aurora MySQL with a reasonably-spec'd RDS instance)

Consider the following schema:

Table T:
    col0: the usual autoincrement primary key
    col1: varchar
    col2: varchar
    col3: varchar
    col4...N: various data

Consider that there is a unique index on:

<col1, col2, col3>

And a non-unique index on:

<col1, col2>

And consider the following query:

SELECT * FROM T
WHERE
    (col1 = 'val1' AND col2 = 'id1') OR
    (col1 = 'val2' AND col2 = 'id2') OR
    ...
    (col1 = 'valN' AND col2 = 'idN');

I would (perhaps naively) expected MySQL to figure out that each element of the OR set matched the (non-unique) index, and performed the query in the way it would have if I had said:

WHERE col0 in (v1, v2, ... , vN)

But it doesn't seem to do that: the timing for these two queries is WAY OFF, on the order of 10x slower for the "or of ands" query. EVEN WITH the secondary key lookup, and the fact that it's a string column lookup, 10x seems a bit severe. Note that EXPLAIN claims to be using the correct/expected index whether I specify (col1, col2) or (col1, col2, col3)

Please note also that:

SELECT * from T
WHERE
    col1 in (list1)
AND
    col2 in (list2);

Is also slow when there are a lot of different values in list1 and list2. Doing an "and" for the three columns is almost intractably slow.

Perhaps not surprisingly, this query works better than the "or of ands" when list1 is of length 1.

Best Answer

With "row constructors", you might get an optimization:

WHERE (col1, col2) IN (('v1', 'id1'), ('v2', 'id2'), ...)

But... In old versions, that would work, but lead to a table scan. I can't say specifically about the version you are running.

When you have this pair of indexes:

UNIQUE(col1, col2, col3)  -- (or plain INDEX)
INDEX(col1, col2)

there is no need for the latter, since the former can handle any queries that need it.

Perhaps the optimal way to write your query is

WHERE col1 in ('v1', 'v2', ...)
  AND (col1, col2) IN (('v1', 'id1'), ('v2', 'id2'), ...)

With that, it will use any index starting with col1 as a crude filter, then use the other part for the rest of the filtering.

Re "convert to an in method" -- MySQL started out as a clean and mean database; it did most of what anyone needed and did it reasonably well. That was 90% of the development. We are now into the other 90% of the development -- the "long tail". Quite possibly some list somewhere includes "convert to an in method". If so, it is being prioritized along with the thousands of other rare and obscure optimizations. Feel free to file a 'feature request' at bugs.mysql.com; that is the way to add it to the list, or bump it up in priority.

Related Solutions

Mysql – Store equations/formula in thesql db table

You can try to leverage dynamic SQL.

If you need to get a calculated value for an id

DELIMITER $$
CREATE PROCEDURE get_value(IN _id INT)
BEGIN
  SET @sql = NULL;

  SELECT CONCAT('SELECT ', formula, ' value FROM table1 WHERE id = ', 1)
    INTO @sql
    FROM table1
   WHERE id = 1;

  PREPARE stmt FROM @sql;
  EXECUTE stmt;
  DEALLOCATE PREPARE stmt;
  SET @sql = NULL;
END$$
DELIMITER ;

Note: You can of course use OUT parameter instead of returning the resultset if you want to.

Sample usage:

CALL get_value(1);

Sample output:

|           VALUE |
|-----------------|
| 82.916129032258 |

Here is how a procedure might look like to get all values calculated by formulas

DELIMITER $$
CREATE PROCEDURE get_values()
BEGIN
  SET @sql = NULL;

  SELECT GROUP_CONCAT(CONCAT(
    'SELECT id, ', formula, ' value FROM table1 WHERE id = ', id)
    ORDER BY id SEPARATOR ' UNION ALL ')
    INTO @sql
    FROM table1;

  PREPARE stmt FROM @sql;
  EXECUTE stmt;
  DEALLOCATE PREPARE stmt;
  SET @sql = NULL;
END$$
DELIMITER ;

Sample usage:

CALL get_value(1);

Sample output:

| ID |           VALUE |
|----|-----------------|
|  1 | 82.916129032258 |
|  2 |    0.0000109375 |

Here is SQLFiddle demo for both procedures

SQL Server – Uniquifier on Non-Unique Clustered Index vs. Unique Clustered Index

One of the most important considerations for a clustered index in terms of performance is that it be ever increasing (or decreasing) and not something that will be changed (except possibly very rarely). The clustered index represents the physical order of your table. If you are constantly inserting into the middle of the index, or modifying the values of your clustered index then you will get bad page splits where SQL has to move part of the data from one page and move it into another one in order to make room for the new data. These moves take time and cause fragmentation that degrades performance.

That being said the size of the clustered index, while important, is not your biggest concern. I did some experiments with using date columns (with a uniqueifier) vs an int column and found that if my queries were date based I still got a big bump in performance.

My suggestion to you is that if ID1, ID2 and ID3 are ever increasing (and rarely change) then use that as your clustered index. If not, and you still want to enforce uniqueness then either make them a non-clustered primary key or a non-clustered unique key. If they are not ever increasing then you can consider your date column or just create a surrogate key for the table. Use an int data type and make it an identity column.

If you do create a surrogate key then you can create a non-clustered index on ID1 to improve the performance on those queries assuming you didn't already create a unique index for it. If you frequently need to return, say ID2 and DTM then you could INCLUDE those in your index to additionally improve your performance.

Best Answer

Related Solutions

Mysql – Store equations/formula in thesql db table

SQL Server – Uniquifier on Non-Unique Clustered Index vs. Unique Clustered Index

Related Question