MySQL Query Performance – Why Queries Become Extremely Slow

mariadbMySQLquery-performance

I have a query running usually within few seconds/minutes which becomes extremely slow after a while (about a week) taking then days! to execute. The query just stays in 'Sending data' and CPU use is 100%. The server is a Mariadb 10.4 and the system is performing many other complex queries without problems, only this specific query seems to hit either some server limitations or a performance bug.

Data amount does not seem to be relevant as the query runs on different databases which are created and deleted for each specific project with different amounts of records, but the problem occurs even for smaller projects.

A restart of the server makes the query run fast again for a while but the problem occurs again over and over. The problem does not seem to occur before the server reaches its maximum allowed amount of RAM, even though there is still free RAM on the server to use (I reduced the buffer size specifically to test it). Once the problem manifests, it happens both with InnoDB and MyISAM engines.
Since the query runs quite fast after a server restart it does not seem to be a problem of missing indices or the like. Any hints what can cause the behaviour and how to investigate/solve?

Here follows the query:

CREATE TABLE counts_otus (
    _sample_id INT,
    _region_sample_id INT,
    sequencesPerOtu INT,
    PRIMARY KEY (_region_sample_id),
    INDEX (_sample_id)
) ENGINE=InnoDB AS 
    SELECT _sample_map._sample_id, _sample_map._region_sample_id, (
            SELECT COUNT(*) 
              FROM cluster AS otu 
             WHERE otu._cluster_sample_id = _sample_map._region_sample_id
    ) + (
            SELECT count(*) 
              FROM cluster AS otu 
        INNER JOIN cluster AS mem 
                ON otu._region_sample_id = mem._cluster_sample_id
             WHERE otu._cluster_sample_id = _sample_map._region_sample_id
    ) + 1 AS sequencesPerOtu
      FROM Region
INNER JOIN _sample_map USING (primaryAccession)
INNER JOIN sample USING (_sample_id)
     WHERE regionTag is NULL
       AND sampleTag is NULL
       AND sample_type <> 'otumap'
;

The query plans are indeed different, which could be determinant in hitting the problem:
The plan when running fast is

+------+--------------------+-------------+------+-------------------------------------------------+----------------+---------+------------------------------------------------------------------------+-------+--------------------------+
| id   | select_type        | table       | type | possible_keys                                   | key            | key_len | ref                                                                    | rows  | Extra                    |
+------+--------------------+-------------+------+-------------------------------------------------+----------------+---------+------------------------------------------------------------------------+-------+--------------------------+
|    1 | PRIMARY            | sample      | ALL  | PRIMARY,id_sample_type                          | NULL           | NULL    | NULL                                                                   | 10    | Using where              |
|    1 | PRIMARY            | _sample_map | ref  | fk_sset_seqent,fk_sset_sample,fk_sset_smapleTag | fk_sset_sample | 4       | silvangs_slv_main_pid23875_rid26315.sample._sample_id                  | 52186 | Using where              |
|    1 | PRIMARY            | Region      | ref  | PRIMARY,fk_rgnTag                               | fk_rgnTag      | 100     | const,silvangs_slv_main_pid23875_rid26315._sample_map.primaryAccession | 1     | Using where; Using index |
|    3 | DEPENDENT SUBQUERY | otu         | ref  | PRIMARY,id_cluster                              | id_cluster     | 4       | silvangs_slv_main_pid23875_rid26315._sample_map._region_sample_id      | 1     | Using index              |
|    3 | DEPENDENT SUBQUERY | mem         | ref  | id_cluster                                      | id_cluster     | 4       | silvangs_slv_main_pid23875_rid26315.otu._region_sample_id              | 1     | Using index              |
|    2 | DEPENDENT SUBQUERY | otu         | ref  | id_cluster                                      | id_cluster     | 4       | silvangs_slv_main_pid23875_rid26315._sample_map._region_sample_id      | 1     | Using index              |
+------+--------------------+-------------+------+-------------------------------------------------+----------------+---------+------------------------------------------------------------------------+-------+--------------------------+

The plan when running extremely slow (killed the running query and took the explain of its select right afterwards:

+------+--------------------+-------------+--------+-------------------------------------------------+----------------+---------+------------------------------------------------------------------------+--------+--------------------------+
| id   | select_type        | table       | type   | possible_keys                                   | key            | key_len | ref                                                                    | rows   | Extra                    |
+------+--------------------+-------------+--------+-------------------------------------------------+----------------+---------+------------------------------------------------------------------------+--------+--------------------------+
|    1 | PRIMARY            | sample      | ALL    | PRIMARY,id_sample_type                          | NULL           | NULL    | NULL                                                                   | 10     | Using where              |
|    1 | PRIMARY            | _sample_map | ref    | fk_sset_seqent,fk_sset_sample,fk_sset_smapleTag | fk_sset_sample | 4       | silvangs_slv_main_pid23875_rid26315.sample._sample_id                  | 41361  | Using where              |
|    1 | PRIMARY            | Region      | ref    | PRIMARY,fk_rgnTag                               | fk_rgnTag      | 100     | const,silvangs_slv_main_pid23875_rid26315._sample_map.primaryAccession | 1      | Using where; Using index |
|    3 | DEPENDENT SUBQUERY | mem         | index  | id_cluster                                      | id_cluster     | 4       | NULL                                                                   | 738041 | Using index              |
|    3 | DEPENDENT SUBQUERY | otu         | eq_ref | PRIMARY,id_cluster                              | PRIMARY        | 4       | silvangs_slv_main_pid23875_rid26315.mem._cluster_sample_id             | 1      | Using where              |
|    2 | DEPENDENT SUBQUERY | otu         | ref    | id_cluster                                      | id_cluster     | 4       | silvangs_slv_main_pid23875_rid26315._sample_map._region_sample_id      | 57226  | Using index              |
+------+--------------------+-------------+--------+-------------------------------------------------+----------------+---------+------------------------------------------------------------------------+--------+--------------------------+

So there are not only "ref" join types when running slow but also "index" and "eq_ref" which should be better as far as I can tell but end up stuck for days.

The question was originally posted to stackoverflow where I got suggestion it would be better suited here at dba, here the link to the question there: https://stackoverflow.com/questions/60952661/why-does-a-query-becomes-extremely-slow-independently-from-data-amount

Best Answer

According to the mysql documentation (https://dev.mysql.com/doc/refman/5.7/en/controlling-query-plan-evaluation.html) it seems the wrong query plan can indeed make the difference in such orders of magnitudes as seconds vs days so I assume the problem lies in the wrong query plan being chosen by the optimizer. Why this is regularly happening after some time the server is running (and the available memory to its buffers is fully allocated) remains a mystery, however the solution seems to lie in giving hints to the optimizer to avoid wrong order of joins and use index referenced in the good query. This is obtained by changing the query as follows:

DROP TABLE IF EXISTS test_counts_otus;
CREATE TABLE test_counts_otus (
    _sample_id INT,
    _region_sample_id INT,
    sequencesPerOtu INT,
    PRIMARY KEY (_region_sample_id),
    INDEX (_sample_id)
) ENGINE=InnoDB AS 
    SELECT _sample_map._sample_id, _sample_map._region_sample_id, (
            SELECT COUNT(*) 
              FROM cluster AS otu FORCE INDEX ( id_cluster )
             WHERE otu._cluster_sample_id = _sample_map._region_sample_id
    ) + (
            SELECT count(*) 
              FROM cluster AS otu FORCE INDEX ( id_cluster )
     STRAIGHT_JOIN cluster AS mem FORCE INDEX ( id_cluster )
                ON otu._region_sample_id = mem._cluster_sample_id
             WHERE otu._cluster_sample_id = _sample_map._region_sample_id
    ) + 1 AS sequencesPerOtu
      FROM Region
INNER JOIN _sample_map USING (primaryAccession)
INNER JOIN sample USING (_sample_id)
     WHERE regionTag is NULL
       AND sampleTag is NULL
       AND sample_type <> 'otumap'
;

The fixed query uses forced indexes and STRAIGHT_JOIN as documented at https://mariadb.com/kb/en/index-hints-how-to-force-query-plans/