Thesqldump vs LOAD DATA INFILE

MySQLmysqldumppartitioning

Why would you ever use mysqldump to repopulate a table over load data infile?

Here's the back story to my question. We had a table that was incorrectly partitioned. It was supposed to be partitioned in 10M row chunks by PK. After 90M => 100M it started getting partitioned in 100M row chunks (100M=>200M etc).

The most recent partitions had 135M rows between them and we decided it was time to bite some downtime and fix this mess.

The game plan was basically:

1) mysqldump >= 100M

2) drop partitions >= 100M

3) Create desired partioned ranges

4) readd via the mysqldump files

These dumps were done in 10M chunks so we could reload into the downed master and slaves simultaneously and do some sanity checking in between so we didn't get too far in before realizing we were screwed. This was an append only table that didn't mutate so we were able to have the historical dumps done before hand and scp'd to local copies before the real downtime had to start. Because of this we added –skip-disable-keys since our version of mysql doesn't allow you to do this per partition and didn't want to continuous rebuild after each chunk so that could have something to do with the performance diff I'm about to lay out.

Some benchmark tests before hand left us with an estimated 90 minutes of downtime. We were wrong; the mysqldump reloads were taking about 3-4 times longer than expected.

We had some time to just twiddle our thumbs. Part of the contingency plan was to not drop one of the slaves partitions until the master and everyone else was rebuilt "just in case". While the rebuild was going we decided to do a test, select * into outfile from the untouched slave for some segments we'd yet to get to, then reload those w/ load data infile.

We made our dump, gzip'd it and copied it over to the machines in process of rebuilding. To save some overhead of having to uncompress it then read it again we gzip -dc'd it into a named pipe and then loaded from them.

The load data approach was finishing in about 4 minutes/chunk rather than the 12-15 minutes the mysqldump reload was taking.

I know the manual says this can be faster for larger loads but this left me struggling for a reason why we should use mysqldump if our schema is already in place?

P.S. I'm aware of reorganize partition but have found this dump/reload into alteeed schema to be more performant in the past.

Best Answer

If you have a schema already in place, you should use mysqldump options to create the dump of the data to perform in the inserts only

mysqldump -h... -u... -p... --no-create-info --databases ... > MySQLData.sql

You should also make sure to raise the bulk_insert_buffer_size (default is 8M) on all DB Servers to accommodate large extended inserts. This will also help LOAD DATA INFILE if loading non-empty tables.

You are going to have to also adjust your max_allowed_packet (default is 1M).

Try these settings for starters:

[mysqld]
max_allowed_packet=256M
bulk_insert_buffer_size=256M

UPDATE 2011-10-11 06:53 EDT

Regardless of how beefy the hardware is, how much RAM you have allocated, how well tuned the OS is, and how current the version of MySQL is, MySQL will only perform as well as it is configured.

Example: If you have a swimming pool that can hold 10,000 gallons of water, you have a truck holding 10,000 gallons of water, and the hose on the truck is only as big as a straw, you can only push but so much water through the straw to fill the pool. Getting a bigger truck or getting a faster water pump simply will not improve things. You must exchange the hose for a much bigger hose to accommodate more water, thus, more throughput.

In like fashion, MySQL 5.5 out-of-the-box does not come fully tuned.

Example #1: MySQL 5.5 comes with semisynchronous replication. By default, it is disabled. You must do the two step process of starting mysql, runnning INSTALL PLUGIN on master and slave modules, shutdown mysql, add timing and activation options to my.cnf for the semisych features, then starting up mysql for the second time. Only then will semisync replication work.

Example #2: The bulk insert buffer by default is 8M. The bulk insert buffer will not grow because of the presence of any specific hardware or software. It stays 8M until you increase it. It is possible to increase it by either adding it to my.cnf and restarting mysql or running SET bulk_insert_buffer_size = 268435456; to set it to 256M within a session and then load the mysqldump within that same session.

Example #3: MySQL 5.5 is fully capable of engaging multiple CPUs. By default, the features for engaging multiple CPUs is disabled. They require tuning because although MySQL 5.5 is multicore ready, MySQL 5.5 is only as multithreaded as you configure it.

Conclusion: You must configure MySQL to recognize that it has beefy hardware, more RAM, and a cooperative OS at its disposal.