A laboratory I work in is attempting to clone the open, biochemical database managed by PubChem found here. (Please don't ask why, I am not in charge of this project, I am simply tasked with the cloning of the database) But their database contains tens of millions of records and the script that I currently have running is using the "READ DATA INFILE" instruction of MySQL to write the 720 thousand csv files we have of their database into our MySQL database… one by one. At the current rate, the sheer amount of data being saved is going to make this script take roughly 240 days to finish, and we don't have that much time. The questions I have are
-
Is there a generally accepted way to quickly enter vast amounts of csv files into
mysql (over 650 GB of data) -
If not, is there a database system better suited for this sort of task, perhaps one designed specifically for big data? (I'm no database expert, so forgive me if this question makes no sense)
Best Answer
Simple approaches:
At some point, of course, you're always going to max out either CPU or I/O on your server, but running more than one process in parallel will help to some extent, especially if the rows aren't all going into just one table. If they are (I'm not familiar with the data set in question) then you do have a potential problem there, since some of these approaches won't help, but if not, then once you max your machine, you still have the option of marshaling multiple servers to do the work of initially creating the tables, and then copy them over to the final destination when you're done.
With MyISAM tables, the table spaces are intrinsically transportable from server to server. You can create the tables on one or more entirely different servers; run
FLUSH TABLES WITH READ LOCK;
and then (while keeping the tables locked by not closing your connection) you can actually copy the.frm
and.MY*
files from one server to the other. When you runSHOW TABLES
after copying the files into the directory for the appropriate database, and setting the permissions, they just magically appear and are accessible, because MyISAM does not have an internal data dictionary. Everything is in those files.In MySQL 5.6, InnoDB tablespaces became transportable also; it's not quite as simple as MyISAM, because the files have to be prepared first: http://dev.mysql.com/doc/refman/5.6/en/tablespace-copying.html ... but still potentially a viable way to get more work done in parallel.
You could also also leverage cloud servers to create and index these tables for you and then copy the tablespace files back to your production system, assuming you have sufficient bandwidth available to actually transfer the files... and you're willing to gamble just a little bit... the spot market price (for idle cloud capacity) for some very serious hardware (multi-core, with high-capacity SSDs, etc.) can be less than 1/10th of the on-demand (normal) price... if you go that route, you have to automate the process to move the completed files off of the cloud server as expeditiously as possible, since the machine can be interrupted at any time if the spot price goes above the amount you're willing to pay.