Mysql – I have multiple sources INSERTing into a MySQL innodb table. My periodic aggregation script never makes accurate aggregates. Why

aggregateMySQL

I apologize in advance if this is a repeat. I'm not really sure how to properly ask for what I'm running into.

I have a large InnoDB table set up. I have 3 sources that all INSERT concurrently at a moderately high volume (300-500 INSERT/s).

I have a PERL script running on a cron job every 5 minutes that aggregates data from the last time it ran. It keeps track of the last record it processed by storing the auto_increment value of the last row in a metadata table.

The aggregates are always off. But not by much. This has been so frustrating because it's just plain simple math (SELECT account_id,sum(sold) GROUP BY account_id). I have a suspicion that it has something to do with the transaction isolation (repeatable-read).

I recently found FlexViews which looks very cool and might address my problem. But I was wondering if anyone could:

  • Tell me what is going on here. Why is this happening?
  • What can be done to produce 100% accurate aggregates

I'm like 95% sure the auto_increment thing along with transaction commit ordering is hosing me up, but I don't understand enough about DBs to really ask the right question.

Oh, one thing to note, I've already checked over the field types. This issues isn't the result of rounding.

Best Answer

Using REPEATABLE-READ means that your SELECT transaction gets to view a "snapshot" of the state of data as of the moment the transaction started. Even if other sessions subsequently insert more data, the SELECT transaction will not see them. That's the way it's supposed to work.

So your query is returning 100% accurate results, with respect to the data that existed at a specific point in the sequence of INSERTs. But you are rapidly changing the data, so your SELECT is bound to return an outdated report.

Using another transaction isolation level probably won't help. InnoDB refreshes a transaction's snapshot only at the beginning of a statement. In other words, if the SELECT takes 0.2 seconds, and during that time another 100 INSERTs happen, those 100 INSERTs won't be included in the result of the SELECT.

http://dev.mysql.com/doc/refman/5.6/en/innodb-consistent-read.html says:

With READ COMMITTED isolation level, each consistent read within a transaction sets and reads its own fresh snapshot.

The only way you can ensure your SELECT is totally in sync with the latest committed data is to prevent any new data from committing while the SELECT is executing. You can lock the tables temporarily with SELECT ... LOCK IN SHARE MODE. See http://dev.mysql.com/doc/refman/5.6/en/innodb-locking-reads.html

But even if you do this, as soon as you finish that transaction, the lock is released and new INSERTs start committing, making the result of your SELECT out of date almost instantly.