Mysql – Load big file in MySQL table

mac os xMySQLmysql-5.7

I've read some posts related to issues similar to this, but without success, so I'm trying to post my specific problem.

I have to load a big CSV file (25Gb) into a MySQL table. I'm doing this with this environment:

  • MacOS Mojave 10.14.3.

  • 16 GB DDR3 RAM Memory.

  • 75 GB free space in my hard disk.

  • MySQL 5.7 for MacOS 10.14 with default configuration.

At first I tried with this:

CREATE DATABASE mydatabase CHARACTER SET utf8;
CREATE TABLE mytable (field1 TEXT, field2 INTEGER, field3 VARCHAR(50));
LOAD DATA INFILE 'hugefile.csv' IGNORE INTO TABLE mytable CHARACTER SET utf8;
CREATE INDEX index_3 ON mytable (field3);

But my computer run out of memory and free hard disk space and the process crashed after more than 30 hours!!

I tried this with the same results:

CREATE DATABASE mydatabase CHARACTER SET utf8;
CREATE TABLE mytable (field1 TEXT, field2 INTEGER, field3 VARCHAR(50));
LOAD DATA INFILE 'hugefile.csv' IGNORE INTO TABLE mytable CHARACTER SET utf8;
CREATE INDEX index_3 ON mytable (field3);

Finally, after some investigation I tried with this MySQL configuration file:

[mysqld]

# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
# innodb_buffer_pool_size = 128M

# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin

# These are commonly set, remove the # and set as required.
# basedir = .....
# datadir = .....
# port = .....
# server_id = .....
# socket = .....
secure_file_priv="/usr/local/mysql/data/"
innodb_buffer_pool_size = 6G
innodb_log_file_size = 4G
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
innodb_io_capacity_max = 2000
innodb_io_capacity = 1000
# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
# join_buffer_size = 128M
# sort_buffer_size = 2M
# read_rnd_buffer_size = 2M 

sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES 

And this code (using MyISAM engine):

CREATE DATABASE mydatabase CHARACTER SET utf8;
CREATE TABLE mytable (field1 TEXT, field2 INTEGER, field3 VARCHAR(50)) ENGINE = MYISAM;
CREATE INDEX index_3 ON mytable (field3);
LOAD DATA INFILE 'hugefile.csv' IGNORE INTO TABLE mytable CHARACTER SET utf8;

The process has a much better performance, it doesn't crash but it takes 4 hours to complete.

The issue is that I've asked one of my coworkers to try this with the exact same process, the same environment and the same configuration (the only difference is that he has 150GB of free hard disk space) and the process takes only 1 hour!

So, my doubt is:

  • What could be the difference between his Mac and mine to have such a different performance? What could I check?
  • Is there anything I could change in MySQL configuration to improve the performance of this process?
  • Is it normal that the process requires all of my free hard disk space? During the process I monitor the free space and it keeps decreasing from 75GB to only 500MB (aprox), when I start getting memory warnings.

EDIT to add answers to Rick's questions:

  • Both of us have SSDs.

  • I haven't set key_buffer_size, so I guess it has its default value (8388608 according to the documentation of MySQL 5.7).

  • That TEXT column has all of its values between 5 and 12 characters

  • Things go sour during create index: with load+create index the load command takes 10 minutes approximately and the create index takes hours; with create index+load, the create index is automatic and load takes hours (normal).

Best Answer

The data in a CSV file is encoded one way; the table in MySQL is encoded another way. The difference could be 2x either way. That is, the 25GB csv file might, for example, expand to 50GB, then have trouble adding on the indexes.

When MySQL runs out of disk space, it waits for you to free some space. This might explain 40 hours. Or it dies.

Does one of you have SSDs, the other have HDDs?

What was the value of key_buffer_size when using MyISAM? Keep in mind that innodb_buffer_pool_size and key_buffer_size are competing for RAM, so you should be careful when switching between engines.

InnoDB wants a PRIMARY KEY; you don't have one.

How big is the TEXT column typically? InnoDB stores them off-record in some situations.

When did things go sour? During the LOAD? Or during the CREATE INDEX?

When adding an index to a MyISAM table, it will either "repair via key buffer" or "repair by copying" (or something like that). The clue is visible in SHOW PROCESSLIST during the CREATE INDEX. If it is by "key_buffer", then 8M is much too small. 1G may be suitable for your machine. If it is the other one, then I am surprised that it took so long.

Another trick (depending on the indexes you have or will have) is to first sort the data by the main key (field3? in your case) before doing the LOAD DATA.

If field1 is always 12 characters or fewer, then declare it VARCHAR(12). (If it might get a bit bigger, pick some other limit.) TEXT has some drawbacks compared to VARCHAR. I am not sure whether it is relevant for this case, but it should not hurt.