Mysql – Optimize MySQL average query for millions of rows

MySQLmysql-8.0

I have 4 tables (I've specified main columns for each):

monitors (id)
monitor_node (monitor_id, node_id, average_response_time)
monitor_checks (id, monitor_id, node_id, response_time)
nodes (id)

Their relations:

1 monitor – N checks

N monitor – N nodes

1 check – 1 monitor & 1 node

So, I'm logging a lot of checks for each monitor. Monitor_checks will have millions of rows (approximately up to 500 millions).

When I insert another bunch of checks (~1k checks), I need to calculate average response time per node (table monitor_node, column average_response_time).

I'm pretty sure I'm doing it the wrong way and I need a faster solution. What I do now: after inserting 1k rows in monitor_checks I calculate average response time for each monitor (grouping by monitor_id). Then, based on this info I make an array and use insert ... on duplicate key update for monitor_node table, in order to do bulk update. Apart from average response time I calculate some other attributes, which goes along with response time in this bulk update.

Making the array with info and executing insert ... on duplicate key update is fast enough.

The slow query is:

select monitor_id, avg(response_time) as avg_response_time 
from `monitor_checks` 
where `node_id` = 2 
group by `monitor_id`

which takes like ~10-20 seconds for ~4m rows I guess.

I also realised that it's not necessary to get average time based on all checks, I can as well use last 50-100 rows. But I couldn't figure out how to do this. I got a fancy solution for grouping checks for each monitor and limit them for some number of rows: How to get latest 2 records of each group but it took way too long as well.

So the question is: how to quickly recalculate average response times in monitor_node when inserting ~1k rows in monitor_checks table?

DB Fiddle: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=bd95afc030361bf1d87f8bc5c3935c2f

Final desired result:

monitor_id  node_id average_response_time
1   1   0.30
1   2   0.25
2   1   0.55 
2   2   0.65

Best Answer

I would add an array to monitor_node, containing the most recent 100 response times. When adding recent response times, drop those exceeding 100. No need for extra data when a new average has to be calculated.

As an image may say more than a thousand words, I've updated the DB Fiddle. A trigger does the job, but that logic could be incorporated in the job doing the inserts.

create trigger monitor_checks_air
  after insert on monitor_checks for each row
  begin
  update monitor_node mn
    set mn.rec_resp_times =
          json_extract(
             json_array_insert(mn.rec_resp_times,'$[0]',new.response_time),
             '$[0 to 99]'),
        mn.average_response_time = (
            select sum(jt.rt)
              from json_table(mn.rec_resp_times,
                             '$[*]' columns( rt double path '$[0]')) as jt
          )
          / json_length(mn.rec_resp_times) 
    where mn.monitor_id = new.monitor_id
      and mn.node_id    = new.node_id;
  end;

The fastest query is the one you don't execute.

UPDATE 2014-02-15 19:49 EST

Your last comment was

The table has only fixed size data (char[60], int, ...) One line takes 126b. About the bulk_insert_buffer_size, since I'm using INSERT ... ON DUPLICATE ..., would it help ? Also, why did you choose 512M, and not a slower value, similar to the max_allowed_packet ? (or maybe 128M)

My choice of 512M was arbitrary. You can set it to whatever you are comfortable with. Just don't leave it at the default value of 8M.

Since you gave the row size, let's to the math.

10,000 rows X 126 bytes/row = 1,260,000 = 1.2 MB

OK, bulk insert buffer may not be an issue.

OBSERVATION

I don't think MySQL likes INSERT IGNORE combined with ON DUPLICATE KEY update. Why ?

INSERT IGNORE says INSERT but reject the incoming row if the PRIMARY KEY already exists.
ON DUPLICATE KEY says INSERT but perform some UPDATE on specific columns if the PRIMARY KEY already exists.
Logically, this does not make sense to use both of them. Which one do you want?
- IGNORE duplicates
- UPDATE columns on duplicates

The INSERT should either be

INSERT IGNORE

INSERT ... ON DUPLICATE KEY

Since you have a meaningful ON DUPLICATE KEY, ditch the word IGNORE.

You could also change INSERT ... ON DUPLICATE KEY into REPLACE INTO if you are replacing all non PRIMARY KEY columns.

Mysql – Help With MySQL Computed Columns/Auto Updating Field Based On Average

How about using a scheduled job to take a snap-shot of your users' weight stats and influencing factors into a different (could be a temp-) table, then using those to calculate the new weight (into a different column of that same table). And then update your user table with the complete set of calculated weight stats.

You'd avoid all kinds of conflicts (from updating with modified data depending on order of execution) and execution time would not be an issue either (though it should not be too bad anway, from what you described), since you could run this as a low priority background process.

Best Answer

Related Solutions

Mysql – INSERT take too much time

UPDATE 2014-02-15 19:49 EST

OBSERVATION

Mysql – Help With MySQL Computed Columns/Auto Updating Field Based On Average

Related Question