MySQL InnoDB Import Performance – Tips and Optimization

importinnodbMySQLperformance

I'm struggling with bulk importing a quite big InnoDB-Table consisting of roughly 10 Million rows (or 7GB) (which for me is the biggest table I've worked with so far).

I did some research how to improve Inno's import speed and for the moment my setup looks like this:

/etc/mysql/my.cnf/
[...]
innodb_buffer_pool_size = 7446915072 # ~90% of memory
innodb_read_io_threads = 64
innodb_write_io_threads = 64
innodb_io_capacity = 5000
innodb_thread_concurrency=0
innodb_doublewrite = 0
innodb_log_file_size = 1G
log-bin = ""
innodb_autoinc_lock_mode = 2
innodb_flush_method = O_DIRECT
innodb_flush_log_at_trx_commit=2
innodb_buffer_pool_instances=8


import is done via bash script, here is the mysql code:
SET GLOBAL sync_binlog = 1;
SET sql_log_bin = 0;
SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET AUTOCOMMIT = 0;
SET SESSION tx_isolation='READ-UNCOMMITTED';
LOAD DATA LOCAL INFILE '$filepath' INTO TABLE monster
COMMIT;

Data is provided in a CSV file.
Currently I test my settings with smaller 'test dumps' with 2 million, 3 million, … rows each and use time import_script.sh to compare performance.

Drawback is I only get a overall running time so I've to wait for the full import to finish to get a result.

My results so far:

10 000 rows: <1 second
100 000 rows: 10 seconds
300 000 rows: 40 seconds
2 million rows: 18 minutes
3 million rows: 26 minutes
4 million rows: (cancelled after 2 hours)

It seems there is no 'cookbook' solution and one has to figure out the optimal mix of settings on their own.
Besides suggestions about what to change in my set up I also would really appreciate more information how I could better benchmark the importing process/gain more insight what is happening and where the bottleneck might be.
I tried to read up the documentation for the settings I'm changing but then again I'm not aware of any side-effects and if I might even decrease performance with a badly chosen value.

For the moment I would like to try a suggestion from chat to use MyISAM during import and change table engine afterwards.
I'd like to try this but for the moment my DROP TABLE query also takes hours to finish. (Which seems another indicator my setting is less then optimal).

Additional information:
The machine I'm currently using has 8GB of RAM and a Solid State Hybrid hard drive w/ 5400RPM.
While we also aim to remove obsolete data from the table in question I still need a somewhat fast import to
a) test automatic data cleanup feature while developing and
b) in case our server crashes we'd like to use our 2nd server as a replacement (which needs up-to-date data, last import took more than 24 hours)

mysql> SHOW CREATE TABLE monster\G
*************************** 1. row ***************************
       Table: monster
Create Table: CREATE TABLE `monster` (
  `monster_id` int(11) NOT NULL AUTO_INCREMENT,
  `ext_monster_id` int(11) NOT NULL DEFAULT '0',
  `some_id` int(11) NOT NULL DEFAULT '0',
  `email` varchar(250) NOT NULL,
  `name` varchar(100) NOT NULL,
  `address` varchar(100) NOT NULL,
  `postcode` varchar(20) NOT NULL,
  `city` varchar(100) NOT NULL,
  `country` int(11) NOT NULL DEFAULT '0',
  `address_hash` varchar(250) NOT NULL,
  `lon` float(10,6) NOT NULL,
  `lat` float(10,6) NOT NULL,
  `ip_address` varchar(40) NOT NULL,
  `cookie` int(11) NOT NULL DEFAULT '0',
  `party_id` int(11) NOT NULL,
  `status` int(11) NOT NULL DEFAULT '2',
  `creation_date` datetime NOT NULL,
  `someflag` tinyint(1) NOT NULL DEFAULT '0',
  `someflag2` tinyint(4) NOT NULL,
  `upload_id` int(11) NOT NULL DEFAULT '0',
  `news1` tinyint(4) NOT NULL DEFAULT '0',
  `news2` tinyint(4) NOT NULL,
  `someother_id` int(11) NOT NULL DEFAULT '0',
  `note` varchar(2500) NOT NULL,
  `referer` text NOT NULL,
  `subscription` int(11) DEFAULT '0',
  `hash` varchar(32) DEFAULT NULL,
  `thumbs1` int(11) NOT NULL DEFAULT '0',
  `thumbs2` int(11) NOT NULL DEFAULT '0',
  `thumbs3` int(11) NOT NULL DEFAULT '0',
  `neighbours` tinyint(4) NOT NULL DEFAULT '0',
  `relevance` int(11) NOT NULL,
  PRIMARY KEY (`monster_id`),
  KEY `party_id` (`party_id`),
  KEY `creation_date` (`creation_date`),
  KEY `email` (`email`(4)),
  KEY `hash` (`hash`(8)),
  KEY `address_hash` (`address_hash`(8)),
  KEY `thumbs3` (`thumbs3`),
  KEY `ext_monster_id` (`ext_monster_id`),
  KEY `status` (`status`),
  KEY `note` (`note`(4)),
  KEY `postcode` (`postcode`),
  KEY `some_id` (`some_id`),
  KEY `cookie` (`cookie`),
  KEY `party_id_2` (`party_id`,`status`)
) ENGINE=InnoDB AUTO_INCREMENT=13763891 DEFAULT CHARSET=utf8

Best Answer

First, you need to know what you are doing to InnoDB when you plow millions of rows into an InnoDB table. Let's take a look at the InnoDB Architecture.

InnoDB Architecture

In the upper left corner, there is an illustration of the InnoDB Buffer Pool. Notice there is a section of it dedicated to the insert buffer. What does that do ? It is ised to migrate changes to secondary indexes from the Buffer Pool to the Insert Buffer inside the system tablespace (a.k.a. ibdata1). By default, innodb_change_buffer_max_size is set to 25. This means that up to 25% of the Buffer Pool can be used for processing secondary indexes.

In your case, you have 6.935 GB for the InnoDB Buffer Pool. A maximum of 1.734 GB will be used for processing your secondary indexes.

Now, look at your table. You have 13 secondary indexes. Each row you process must generate a secondary index entry, couple it with the primary key of the row, and send them as a pair from the Insert Buffer in the Buffer Pool into the Insert Buffer in ibdata1. That happens 13 times with each row. Multiply this by 10 million and you can almost feel a bottleneck coming.

Don't forget that importing 10 million rows in a single transaction will pile up everything into one rollback segment and fill up the UNDO space in ibdata1.

SUGGESTIONS

SUGGESTION #1

My first suggestion for importing this rather large table would be

Drop all the non-unique indexes
Import the data
Create all the non-unique indexes

SUGGESTION #2

Get rid of duplicate indexes. In your case, you have

KEY `party_id` (`party_id`),
KEY `party_id_2` (`party_id`,`status`)

Both indexes start with party_id, you can increase secondary index processing by at least 7.6 % getting rid one index out of 13. You need to eventually run

ALTER TABLE monster DROP INDEX party_id;

SUGGESTION #3

Get rid of indexes you do not use. Look over your application code and see if your queries use all the indexes. You may want to look into pt-index-usage to let it suggest what indexes are not being used.

SUGGESTION #4

You should increase the innodb_log_buffer_size to 64M since the default is 8M. A bigger log buffer may increase InnoDB write I/O performance.

EPILOGUE

Putting the first two suggestions in place, do the following:

Drop the 13 non-unique indexes
Import the data
Create all the non-unique indexes except the party_id index

Perhaps the following may help

CREATE TABLE monster_new LIKE monster;
ALTER TABLE monster_new
  DROP INDEX `party_id`,
  DROP INDEX `creation_date`,
  DROP INDEX `email`,
  DROP INDEX `hash`,
  DROP INDEX `address_hash`,
  DROP INDEX `thumbs3`,
  DROP INDEX `ext_monster_id`,
  DROP INDEX `status`,
  DROP INDEX `note`,
  DROP INDEX `postcode`,
  DROP INDEX `some_id`,
  DROP INDEX `cookie`,
  DROP INDEX `party_id_2`;
ALTER TABLE monster RENAME monster_old;
ALTER TABLE monster_new RENAME monster;

Import the data into monster. Then, run this

ALTER TABLE monster
  ADD INDEX `creation_date`,
  ADD INDEX `email` (`email`(4)),
  ADD INDEX `hash` (`hash`(8)),
  ADD INDEX `address_hash` (`address_hash`(8)),
  ADD INDEX `thumbs3` (`thumbs3`),
  ADD INDEX `ext_monster_id` (`ext_monster_id`),
  ADD INDEX `status` (`status`),
  ADD INDEX `note` (`note`(4)),
  ADD INDEX `postcode` (`postcode`),
  ADD INDEX `some_id` (`some_id`),
  ADD INDEX `cookie` (`cookie`),
  ADD INDEX `party_id_2` (`party_id`,`status`);

GIVE IT A TRY !!!

ALTERNATIVE

You could create a table called monster_csv as a MyISAM table with no indexes and do this:

CREATE TABLE monster_csv ENGINE=MyISAM AS SELECT * FROM monster WHERE 1=2;
ALTER TABLE monster RENAME monster_old;
CREATE TABLE monster LIKE monster_old;
ALTER TABLE monster DROP INDEX `party_id`;

Import your data into monster_csv. Then, use mysqldump to create another import

mysqldump -t -uroot -p mydb monster_csv | sed 's/monster_csv/monster/g' > data.sql

The mysqldump file data.sql will extended INSERT commands importing 10,000-20,000 rows at a time.

Now, just load the mysqldump

mysql -uroot -p mydb < data.sql

Finally, get rid of the MyISAM table

DROP TABLE monster_csv;

Related Solutions

Mysql – Unexplained InnoDB timeouts

I know this is really late, but you really need to capture the output of SHOW ENGINE INNODB STATUS; during that query to see why it's waiting.

If it happens a lot during a specific time, it would be easy to just grab that output every x seconds and hope you capture it (or perhaps artificially generate the load).

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.