MySQL – Efficiently Cloning Tens of Millions of Database Records

csvMySQL

A laboratory I work in is attempting to clone the open, biochemical database managed by PubChem found here. (Please don't ask why, I am not in charge of this project, I am simply tasked with the cloning of the database) But their database contains tens of millions of records and the script that I currently have running is using the "READ DATA INFILE" instruction of MySQL to write the 720 thousand csv files we have of their database into our MySQL database… one by one. At the current rate, the sheer amount of data being saved is going to make this script take roughly 240 days to finish, and we don't have that much time. The questions I have are

  1. Is there a generally accepted way to quickly enter vast amounts of csv files into
    mysql (over 650 GB of data)

  2. If not, is there a database system better suited for this sort of task, perhaps one designed specifically for big data? (I'm no database expert, so forgive me if this question makes no sense)

Best Answer

Simple approaches:

  • If your tables are indexed, drop the indexes and put them back after you load the data. It's generally going to be faster for MySQL to build the index on a table in one pass than to continually index the incoming data as you add it.
  • Run more than one copy of your script, each with a subset of the data source files, doing inserts in parallel
  • Enable compression if you're using InnoDB: http://dev.mysql.com/doc/refman/5.6/en/innodb-compression.html Typically we think of compression as making things smaller with a trade-off in speed... but my experience with InnoDB compression is that it can actually perform faster when you're dealing with a server that is blocking on I/O but has spare CPU capacity, because you can trade some of one for some of the other -- the time saved by not having to do disk read/write improves performance more than the extra CPU cycles for compress/decompress harms performance.

At some point, of course, you're always going to max out either CPU or I/O on your server, but running more than one process in parallel will help to some extent, especially if the rows aren't all going into just one table. If they are (I'm not familiar with the data set in question) then you do have a potential problem there, since some of these approaches won't help, but if not, then once you max your machine, you still have the option of marshaling multiple servers to do the work of initially creating the tables, and then copy them over to the final destination when you're done.

With MyISAM tables, the table spaces are intrinsically transportable from server to server. You can create the tables on one or more entirely different servers; run FLUSH TABLES WITH READ LOCK; and then (while keeping the tables locked by not closing your connection) you can actually copy the .frm and .MY* files from one server to the other. When you run SHOW TABLES after copying the files into the directory for the appropriate database, and setting the permissions, they just magically appear and are accessible, because MyISAM does not have an internal data dictionary. Everything is in those files.

In MySQL 5.6, InnoDB tablespaces became transportable also; it's not quite as simple as MyISAM, because the files have to be prepared first: http://dev.mysql.com/doc/refman/5.6/en/tablespace-copying.html ... but still potentially a viable way to get more work done in parallel.

You could also also leverage cloud servers to create and index these tables for you and then copy the tablespace files back to your production system, assuming you have sufficient bandwidth available to actually transfer the files... and you're willing to gamble just a little bit... the spot market price (for idle cloud capacity) for some very serious hardware (multi-core, with high-capacity SSDs, etc.) can be less than 1/10th of the on-demand (normal) price... if you go that route, you have to automate the process to move the completed files off of the cloud server as expeditiously as possible, since the machine can be interrupted at any time if the spot price goes above the amount you're willing to pay.