Mysql – Index before or after bulk load using load infile

bulkimportindexmyisamMySQL

I have a database with over 1B rows and two columns that are indexed (in addition to the PK).
Is it better to have the index pre-defined in the table before the load infile or better to index after the data has been loaded?

A couple of notes regarding data size and system:

  • System is Linux w/ 8 cores and 32GB memory (currently maxed out
    unless I move to new HW)
  • DB is 1B rows that in raw data size is 150GB data.
  • Database is MyISAM and is mainly read-only after it's loaded.

Best Answer

I have tried a different variety of solution with a similar data load -over 1B- but the better that I have found is this:

From MySQL documentation

With some extra work, it is possible to make LOAD DATA INFILE run even faster for a MyISAM table when the table has many indexes. Use the following procedure:

  1. Execute a FLUSH TABLES statement or a mysqladmin flush-tables command.

  2. Use myisamchk --keys-used=0 -rq /path/to/db/tbl_name to remove all use of indexes for the table.

  3. Insert data into the table with LOAD DATA INFILE. This does not update any indexes and therefore is very fast.

  4. Re-create the indexes with myisamchk -rq /path/to/db/tbl_name. This creates the index tree in memory before writing it to disk, which is much faster that updating the index during LOAD DATA INFILE because it avoids lots of disk seeks. The resulting index tree is also perfectly balanced.

  5. Execute a FLUSH TABLES statement or a mysqladmin flush-tables command.

LOAD DATA INFILE performs the preceding optimization automatically if the MyISAM table into which you insert data is empty. The main difference between automatic optimization and using the procedure explicitly is that you can let myisamchk allocate much more temporary memory for the index creation than you might want the server to allocate for index re-creation when it executes the LOAD DATA INFILE statement.

In order to obtain better performance from the myisamchk you have to tune some params like :

--key_buffer_size --myisam_sort_buffer_size --read_buffer_size --write_buffer_size

Depending on your hardware architecture

Note

When using LOCAL with LOAD DATA, a copy of the file is created in the server's temporary directory. This is not the directory determined by the value of tmpdir or slave_load_tmpdir, but rather the operating system's temporary directory, and is not configurable in the MySQL Server.

So, you you have this kind of problem and your file it's a csv, you can split you "huge" file into chunks

$ split -l (numbersofrowsinfile / ((filesize/tmpsize) + 1)) /path/to/your/<file>.csv

Then repeat your LOAD DATA LOCAL (step 3) for every chunk file.