Mysql – Execution plan changed for existing query – high CPU usage

indexMySQLmysql-5.7

So few weeks ago we had an alert on our production MySQL database due to a CPU usage spike. We identified the issue as a query that had been there for several months that suddenly was misbehaving. The execution plan of that query is:

    +----+-------------+------------+------------+------+----------------------------------------------------------+----------------------------+---------+----------------------------+------+----------+----------------------------------------------+
    | id | select_type | table      | partitions | type | possible_keys                                            | key                        | key_len | ref                        | rows | filtered | Extra                                        |
    +----+-------------+------------+------------+------+----------------------------------------------------------+----------------------------+---------+----------------------------+------+----------+----------------------------------------------+
    |  1 | PRIMARY     | <derived2> | NULL       | ALL  | NULL                                                     | NULL                       | NULL    | NULL                       | 2485 |   100.00 | NULL                                         |
    |  2 | DERIVED     | c          | NULL       | ALL  | PRIMARY                                                  | NULL                       | NULL    | NULL                       | 7420 |    10.00 | Using where; Using temporary; Using filesort |
    |  2 | DERIVED     | app        | NULL       | ref  | applications_status_ix,applications_ibfk_3               | applications_ibfk_3        | 4       | prod_v3.c.id               |  750 |     0.41 | Using where                                  |
    |  2 | DERIVED     | inv        | NULL       | ref  | app_id                                                   | app_id                     | 4       | ref.app.app_id             |    1 |   100.00 | NULL                                         |
    +----+-------------+------------+------------+------+----------------------------------------------------------+----------------------------+---------+----------------------------+------+----------+----------------------------------------------+

What we did was to add an index hint to use applications_status_ix and then the query performance went back to normal and the CPU usage went down:

+----+-------------+------------+------------+--------+-------------------------------+-------------------------------+---------+----------------------------+--------+----------+---------------------------------------------------------------------+
| id | select_type | table      | partitions | type   | possible_keys                 | key                           | key_len | ref                        | rows   | filtered | Extra                                                               |
+----+-------------+------------+------------+--------+-------------------------------+-------------------------------+---------+----------------------------+--------+----------+---------------------------------------------------------------------+
|  1 | PRIMARY     | <derived2> | NULL       | ALL    | NULL                          | NULL                          | NULL    | NULL                       |   5077 |   100.00 | NULL                                                                |
|  2 | DERIVED     | app        | NULL       | range  | applications_status_ix        | applications_status_ix        | 1       | NULL                       | 464405 |    10.00 | Using index condition; Using where; Using temporary; Using filesort |
|  2 | DERIVED     | c          | NULL       | eq_ref | PRIMARY                       | PRIMARY                       | 4       | app.company_id             |      1 |    10.00 | Using where                                                         |
|  2 | DERIVED     | inv        | NULL       | ref    | app_id                        | app_id                        | 4       | app_id                     |      1 |   100.00 | NULL                                                                |
+----+-------------+------------+------------+--------+-------------------------------+-------------------------------+---------+----------------------------+--------+----------+---------------------------------------------------------------------+

Basically on the query we want certain application status that have a minimum appearance on the table. That's why is better using that index even with this cardinality:

+---------------------+------------+---------------------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table               | Non_unique | Key_name                        | Seq_in_index | Column_name       | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------------+------------+---------------------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| applications        |          1 | applications_status_ix          |            1 | status            | A         |           9 |     NULL | NULL   |      | BTREE      |         |               |
| applications        |          1 | applications_ibfk_3             |            1 | company_id        | A         |       16240 |     NULL | NULL   |      | BTREE      |         |               |
+---------------------+------------+---------------------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

So I have two questions:

  1. Why would the query planner suddenly decided use a different execution plan on a query that has been in use for a while? I do not have hard proof, but I really doubt the cardinality changed materially from what we have now.
  2. Trying to find a solution we found a about Histograms (values distribution is called on other DBMS). But unfortunately they are only available on MySQL 8 and we are still using version 5.7. Do you happen to know any other workaround on our MySQL versions?

EDIT1:

So the query being executed is

SELECT sum((CASE
                WHEN `alias_50656806`.`status` = 'CAPTURED' THEN `alias_50656806`.`amount`
                ELSE `alias_50656806`.`due_by` END - `alias_50656806`.`recovered_amount`))
FROM (SELECT `app`.`app_id`,
             `app`.`status`,
             `app`.`amount`,
             `app`.`purchases`,
             `app`.`due_by`,
             sum(CASE
                     WHEN (`inv`.`paid_at` IS NULL AND DATEDIFF(current_date(), `inv`.`created_at`) < 10) THEN 0
                     ELSE IFNULL(`inv`.`amount_due`, 0) END) AS `recovered_amount`
      FROM `applications` AS `app`
               JOIN `companies` AS `c` ON `app`.`company_id` = `c`.`id`
               LEFT OUTER JOIN `application_invoices` AS `inv` USING (`app_id`)
      WHERE (NOT (`app`.`repurchases`) AND
             `app`.`status` IN ('CAPTURED', 'LOCKED', 'ERROR') AND `c`.`is_test` = FALSE AND 1 = 1)
      GROUP BY `app`.`credit_app_id`, `app`.`status`, `app`.`credit_amount`, `app`.`cashless_repurchases`,
               `app`.`due_by`) AS `alias_50656806`;

An the table sizes are:

  • applications has 12701431 records.
  • companies has 7500 records
  • the status distributions is
STATUS          COUNT (*)
RECOVER_FUNDS   46400
ERROR           18792
LOCKED          3
CAPTURED        151854

And applications table DDL is

CREATE TABLE `applications` (
  `app_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `created_at` timestamp(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
  `updated_at` timestamp(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3) ON UPDATE CURRENT_TIMESTAMP(3),
  `status` enum('PREAPPROVED','INCOMPLETE','DENIED','PENDING_DATA','PENDING_CC','FAILED_DATA','ERROR','READY','CAPTURED','EXPIRED','SETTLED','LOCKED','OPTED_OUT','RECOVER_FUNDS','EXCLUDED','ABANDONED','BAD_OPT') COLLATE utf8_bin NOT NULL,
  `account_code` enum('REF','NO_REF') COLLATE utf8_bin NOT NULL,
  `amount` int(10) unsigned NOT NULL,
  `customer_id` int(11) unsigned NOT NULL,
  `company_id` int(11) unsigned NOT NULL,
  `flow_id` int(11) unsigned NOT NULL,
  `app_ext_id` varchar(20) COLLATE utf8_bin NOT NULL,
  `cashless_repurchases` tinyint(1) NOT NULL DEFAULT '0',
  `random_no` float unsigned NOT NULL DEFAULT '0.5',
  `breakage_amount` int(11) DEFAULT NULL,
  `first_repurch` timestamp(3) NULL DEFAULT NULL,
  `last_repurch` timestamp(3) NULL DEFAULT NULL,
  `repurch_count` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `total_repurch_amount` int(10) unsigned NOT NULL DEFAULT '0',
  `due_by` int(10) unsigned DEFAULT NULL,
  `due_by_customer` int(10) unsigned DEFAULT NULL,
  `last_modified_by` varchar(64) COLLATE utf8_bin DEFAULT NULL,
  `app_user_id` int(11) unsigned DEFAULT NULL,
  PRIMARY KEY (`app_id`),
  UNIQUE KEY `flow_id` (`flow_id`),
  UNIQUE KEY `app_ext_id` (`app_ext_id`),
  KEY `applications_customer_ix` (`customer_id`),
  KEY `applications_status_ix` (`status`),
  KEY `applications_ibfk_3` (`company_id`),
  CONSTRAINT `applications_ibfk_1` FOREIGN KEY (`flow_id`) REFERENCES `flow_entries` (`id`),
  CONSTRAINT `applications_ibfk_2` FOREIGN KEY (`customer_id`) REFERENCES `customers` (`id`),
  CONSTRAINT `applications_ibfk_3` FOREIGN KEY (`company_id`) REFERENCES `companies` (`id`),
  CONSTRAINT `applications_ibfk_4` FOREIGN KEY (`status`) REFERENCES `application_statuses` (`status`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=12709869 DEFAULT CHARSET=utf8 COLLATE=utf8_bin

Best Answer

The values in filtered column in both execution plans look pretty off target. Maybe it would help to disable condition_fanout_filter optimizer switch in the session executing these multi-join queries (see more about condition filtering).

Why would the query planner suddenly decided use a different execution plan on a query that has been in use for a while? I do not have hard proof, but I really doubt the cardinality changed ...

It probably takes the growing applications table size into account even if the number of records to select is mostly constant. Maybe an optimizer switch to make it prefer larger tables as leading would help to reduce the probability of similar failures in other queries (somewhat like Oracle's DB_FILE_MULTIBLOCK_READ_COUNT). But I don't see any such feature in MySQL.

Do you happen to know any other workaround

Query optimizer is somehow able to estimate the number of selected rows in applications : 464405 . But that's at least twice as more as the actual count from "table sizes" section. Maybe just ANALYZE TABLE applications may help to reduce the estimate and get the better execution plan without hints?