MySQL – Updating/Inserting 2000 Entries Takes Over 12 Hours

amazon-rdsMySQLmysql-5.6

I have three tables of around 10 million rows each in MySQL on Amazon RDS and I am finding the insert/update performance to be VERY slow. Each table contains unrelated data.

Each table is regularly updated with about 2000 rows each time. The rows either update existing values or insert new ones. I am finding that updating or inserting 2000 rows takes over 12 hours to perform.

The database has 100GB of General SSD storage which means that it can sustain 300 IOPS. CPU utilisation is below 20% and RAM is 50% free.

I am using an index and a primary key which was constructed from two columns within the table using: CONSTRAINT PK_1 PRIMARY KEY (DATE, NAME)

I am using the following query:

    INSERT INTO Table (DATE, NAME, COLUMN_1, ..., COLUMN_10)
    VALUES 
    ('2015-05-26','David', VALUE_1, ... , VALUE_10),
    ...
    ('2015-05-26','Tom', VALUE_1, ... , VALUE_10)
    ON DUPLICATE KEY UPDATE COLUMN_1=VALUES(COLUMN_1), ... , COLUMN_10=VALUES(COLUMN_10);

The above query inserts/updates 7 rows and 10 columns of data. This happens for all 2000 entries(users).

The reason why I have chosen to use the above statement instead of a REPLACE INTO statement is because whilst the table is 40 columns wide, I only receive 10 of those columns at once. The values of the rows can also change for up to a week.

By default, the data was being inserted into the three tables sequentially. I have tried inserting the data in parallel. However, there was no significant improvement in the performance.

I don't understand why it takes over 12 hours to insert 2000 rows. What are my options to improving performance?

Best Answer

I'm going to give this my best guess, but I will preface saying that we don't have intimate knowledge of your schema, and while I am knowledgeable about MySQL, I would not call myself an expert.

Index-organized tables

One problem you might be having is due to the way in which MySQL stores data, along with your UPSERT behavior. Using the InnoDB engine in MySQL, every table is an index-organized table. This means that the data payload is stored directly within the pages of the index. In effect, the primary key index is also the table structure.

Now, in your case, you are inserting rows with a total of 42 columns. Without knowing the table structure, I am guessing the some of these 40 unknown columns are of variable length. So each time you update the data, there is likely to be quite a lot of data movement due to this. InnoDB has to rewrite the index and data pages in order to accommodate the new data.

Change your schema?

So, based on this likely limitation you are seeing, I think one thing you should consider is a schema change. What you have now is essentially one big flat table. But, you've stated that you receive only 10 of these 40 unknown columns at a time, so why don't you have at least 5 tables? That is (in loose pseudo code, and make sure to add your FK relationships):

CREATE TABLE record (RecID, DATE, NAME);
CREATE TABLE column_set_1 (RecID,COL1,...,COL10);
CREATE TABLE column_set_2 (RecID,COL11,...,COL20);
CREATE TABLE column_set_3 (RecID,COL21,...,COL30);
CREATE TABLE column_set_4 (RecID,COL31,...,COL40);

This way, you are only performing inserts, and only into the relevant tables without touching all this other data.

Table Partitioning

Also, because you have a DATE field, it may be helpful to partition your table. I can't say for sure if this will have a strong positive impact, but I suspect it may. If you go for my suggested schema change, for example, I might choose to hash partition the 4 column_set_* tables based on RecID and then the INSERT...ON DUPLICATE KEY UPDATE can make use of partition lock pruning.

Best of luck, and maybe someone with better MySQL expertise can correct any mistakes I've made.