Mysql – I have multiple sources INSERTing into a MySQL innodb table. My periodic aggregation script never makes accurate aggregates. Why

aggregateMySQL

I apologize in advance if this is a repeat. I'm not really sure how to properly ask for what I'm running into.

I have a large InnoDB table set up. I have 3 sources that all INSERT concurrently at a moderately high volume (300-500 INSERT/s).

I have a PERL script running on a cron job every 5 minutes that aggregates data from the last time it ran. It keeps track of the last record it processed by storing the auto_increment value of the last row in a metadata table.

The aggregates are always off. But not by much. This has been so frustrating because it's just plain simple math (SELECT account_id,sum(sold) GROUP BY account_id). I have a suspicion that it has something to do with the transaction isolation (repeatable-read).

I recently found FlexViews which looks very cool and might address my problem. But I was wondering if anyone could:

Tell me what is going on here. Why is this happening?
What can be done to produce 100% accurate aggregates

I'm like 95% sure the auto_increment thing along with transaction commit ordering is hosing me up, but I don't understand enough about DBs to really ask the right question.

Oh, one thing to note, I've already checked over the field types. This issues isn't the result of rounding.

Best Answer

Using REPEATABLE-READ means that your SELECT transaction gets to view a "snapshot" of the state of data as of the moment the transaction started. Even if other sessions subsequently insert more data, the SELECT transaction will not see them. That's the way it's supposed to work.

So your query is returning 100% accurate results, with respect to the data that existed at a specific point in the sequence of INSERTs. But you are rapidly changing the data, so your SELECT is bound to return an outdated report.

Using another transaction isolation level probably won't help. InnoDB refreshes a transaction's snapshot only at the beginning of a statement. In other words, if the SELECT takes 0.2 seconds, and during that time another 100 INSERTs happen, those 100 INSERTs won't be included in the result of the SELECT.

http://dev.mysql.com/doc/refman/5.6/en/innodb-consistent-read.html says:

With READ COMMITTED isolation level, each consistent read within a transaction sets and reads its own fresh snapshot.

The only way you can ensure your SELECT is totally in sync with the latest committed data is to prevent any new data from committing while the SELECT is executing. You can lock the tables temporarily with SELECT ... LOCK IN SHARE MODE. See http://dev.mysql.com/doc/refman/5.6/en/innodb-locking-reads.html

But even if you do this, as soon as you finish that transaction, the lock is released and new INSERTs start committing, making the result of your SELECT out of date almost instantly.

Related Solutions

Mysql – Is there slowdown inserting into an InnoDB table that has no index set

I haven't tested this (as my tables usually have at least a primary key) but I expect the difference to depend on the choice of the primary key.

Based on the information in MySQL documentation about InnoDB engine, all InnoDB tables have a clustered index. This is the PRIMARY key of the table and in lack of one, the first UNIQUE index. And in lack of unique keys as well, a hidden column is created (and values are auto-generated for it) and used internally for unique identification and for clustering.

The "key" here is that this hidden column is a 6-byte integer column. So, you have 6 bytes per row overhead when you don't define primary and any unique index in an InnoDB table.

In contrast, if you have a narrower column (like a 1, 2, 3 or 4 byte integer auto-incrementing column) defined as primary or unique index , you should expect an improvement on INSERT efficiency.

If you define an 8-byte integer I'd expect a decrease on efficiency.

If it is a VARCHAR(20) or a wider combination of column, the decrease would probably be worse, depending on the order you provide the PK values. Auto-incrementing values are by definition increasing (so good for a clustered index). If you provide non-increasing values, that will affect the INSERT efficienct negatively.

Mysql – Inserting into thesql table with archive engine “duplicate key” error

This sounds very unusual for a table using the ARCHIVE Storage Engine. Why? A duplicate key error is not characteristic for ARCHIVE Storage Engine since

Engine does not support the creation of indexes
Engine supports INSERTs and SELECTs

Surprisingly, there can be a key internally present. How?

According to the MySQL Documentation

The ARCHIVE engine supports the AUTO_INCREMENT column attribute. The AUTO_INCREMENT column can have either a unique or nonunique index. Attempting to create an index on any other column results in an error. The ARCHIVE engine also supports the AUTO_INCREMENT table option in CREATE TABLE and ALTER TABLE statements to specify the initial sequence value for a new table or reset the sequence value for an existing table, respectively.

Given this information, look back at the table and the query

insert into test_table (id,arch) values (123,'FILE_CONTENT')

If the id column has the AUTO_INCREMENT attribute, you should not specify id with a value. It would produce a normal 1062 error (Duplicate Key) for other Storage Engines.

SUGGESTIONS

Change the insert to a format that can handle the AUTO_INCREMENT attribute of id

insert into test_table (id,arch) values (0,'FILE_CONTENT')

insert into test_table (arch) values ('FILE_CONTENT')

Give it a Try !!!

UPDATE 2013-08-06 16:57 EST

If you are planning to do queries from the archive table, you need to get away from the ARCHIVE Storage Engine. Why? Again, according to the MySQL Documentation

Retrieval: On retrieval, rows are uncompressed on demand; there is no row cache. A SELECT operation performs a complete table scan: When a SELECT occurs, it finds out how many rows are currently available and reads that number of rows. SELECT is performed as a consistent read. Note that lots of SELECT statements during insertion can deteriorate the compression, unless only bulk or delayed inserts are used

Note that every SELECT against an ARCHIVE table is a full table scan. If you lookup id 123 with 1,000,000 rows, you gotta read 1,000,000 rows every time.

SUGGESTION

Convert the table to MyISAM. Then you can have a proper index on id plus the ability to create other indexes on other columns as needed.

CREATE TABLE test_table_myisam ENGINE=MyISAM as SELECT * FROM test_table WHERE 1=2;
ALTER TABLE test_table_myisam ADD PRIMARY KEY (id);
INSERT IGNORE INTO test_table_myisam SELECT * FROM test_table;
DROP TABLE test_table;
ALTER TABLE test_table_myisam RENAME test_table;