MySQL group replication – how to find queries causing replication lag

MySQL

How can we determine which queries are causing a set of MySQL servers running Group Replication to have replication lag?

We have a MySQL Group Replication cluster with 9 servers, and we are currently experiencing replication lag (i.e high values in performance_schema.replication_group_member_stats.COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE for several servers). We're using MySQL Community version 8.0.19.

We noticed that neither slow queries nor frequent queries are necessarily correlated with lag across servers. We have also noticed that some specific joins we had in our application were causing lag spikes and, once we removed them, the lag reduced. This is, however, not the case for all joins.

Is there a tool or a methodology we can use to find the queries causing the lag?

Best Answer

Turn on the slowlog on both machines. What version of MySQL are you using? Here are good settings to have for a relatively new version:

log_output = FILE
slow_query_log = ON
slow_query_log_file = (fullpath to some file)
long_query_time = 1
log_slow_admin_statements = ON
log_queries_not_using_indexes = OFF
log_slow_slave_statements
log_slow_extra = ON

More: http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog

A possible cause of what you are seeing: An ALTER that takes some time and hits tableA is replicated. Meanwhile, other things are happening (read or write) to tableA. The above settings will probably catch the ALTER.

Another thing to look for (but hard to see) in the slowlog: A bunch of queries ending at about the same second. This usually means that one of them took a long time and was blocking the others (either directly or indirectly).

In any case, use pt-query-digest; its default sort is to put the "worst" queries first. Then...

  • The blocker may be among the first few.
  • The first few are likely to be queries that you should work on improving -- better indexing, reformulation, etc.
  • Sometimes a "worst" query is a very fast query that was blocked by the blockage; that is it was a "victim", not a "villain". Example: UPDATE ... WHERE primary_key = 123 taking second(s) or even minute(s), not milliseconds.