Mysql – Index before or after bulk load using load infile

bulkimportindexmyisamMySQL

I have a database with over 1B rows and two columns that are indexed (in addition to the PK).
Is it better to have the index pre-defined in the table before the load infile or better to index after the data has been loaded?

A couple of notes regarding data size and system:

System is Linux w/ 8 cores and 32GB memory (currently maxed out
unless I move to new HW)
DB is 1B rows that in raw data size is 150GB data.
Database is MyISAM and is mainly read-only after it's loaded.

Best Answer

I have tried a different variety of solution with a similar data load -over 1B- but the better that I have found is this:

From MySQL documentation

With some extra work, it is possible to make LOAD DATA INFILE run even faster for a MyISAM table when the table has many indexes. Use the following procedure:

Execute a FLUSH TABLES statement or a mysqladmin flush-tables command.
Use myisamchk --keys-used=0 -rq /path/to/db/tbl_name to remove all use of indexes for the table.
Insert data into the table with LOAD DATA INFILE. This does not update any indexes and therefore is very fast.
Re-create the indexes with myisamchk -rq /path/to/db/tbl_name. This creates the index tree in memory before writing it to disk, which is much faster that updating the index during LOAD DATA INFILE because it avoids lots of disk seeks. The resulting index tree is also perfectly balanced.
Execute a FLUSH TABLES statement or a mysqladmin flush-tables command.

LOAD DATA INFILE performs the preceding optimization automatically if the MyISAM table into which you insert data is empty. The main difference between automatic optimization and using the procedure explicitly is that you can let myisamchk allocate much more temporary memory for the index creation than you might want the server to allocate for index re-creation when it executes the LOAD DATA INFILE statement.

In order to obtain better performance from the myisamchk you have to tune some params like :

--key_buffer_size --myisam_sort_buffer_size --read_buffer_size --write_buffer_size

Depending on your hardware architecture

Note

When using LOCAL with LOAD DATA, a copy of the file is created in the server's temporary directory. This is not the directory determined by the value of tmpdir or slave_load_tmpdir, but rather the operating system's temporary directory, and is not configurable in the MySQL Server.

So, you you have this kind of problem and your file it's a csv, you can split you "huge" file into chunks

$ split -l (numbersofrowsinfile / ((filesize/tmpsize) + 1)) /path/to/your/<file>.csv

Then repeat your LOAD DATA LOCAL (step 3) for every chunk file.

Related Solutions

Thesqldump vs LOAD DATA INFILE

If you have a schema already in place, you should use mysqldump options to create the dump of the data to perform in the inserts only

mysqldump -h... -u... -p... --no-create-info --databases ... > MySQLData.sql

You should also make sure to raise the bulk_insert_buffer_size (default is 8M) on all DB Servers to accommodate large extended inserts. This will also help LOAD DATA INFILE if loading non-empty tables.

You are going to have to also adjust your max_allowed_packet (default is 1M).

Try these settings for starters:

[mysqld]
max_allowed_packet=256M
bulk_insert_buffer_size=256M

UPDATE 2011-10-11 06:53 EDT

Regardless of how beefy the hardware is, how much RAM you have allocated, how well tuned the OS is, and how current the version of MySQL is, MySQL will only perform as well as it is configured.

Example: If you have a swimming pool that can hold 10,000 gallons of water, you have a truck holding 10,000 gallons of water, and the hose on the truck is only as big as a straw, you can only push but so much water through the straw to fill the pool. Getting a bigger truck or getting a faster water pump simply will not improve things. You must exchange the hose for a much bigger hose to accommodate more water, thus, more throughput.

In like fashion, MySQL 5.5 out-of-the-box does not come fully tuned.

Example #1: MySQL 5.5 comes with semisynchronous replication. By default, it is disabled. You must do the two step process of starting mysql, runnning INSTALL PLUGIN on master and slave modules, shutdown mysql, add timing and activation options to my.cnf for the semisych features, then starting up mysql for the second time. Only then will semisync replication work.

Example #2: The bulk insert buffer by default is 8M. The bulk insert buffer will not grow because of the presence of any specific hardware or software. It stays 8M until you increase it. It is possible to increase it by either adding it to my.cnf and restarting mysql or running SET bulk_insert_buffer_size = 268435456; to set it to 256M within a session and then load the mysqldump within that same session.

Example #3: MySQL 5.5 is fully capable of engaging multiple CPUs. By default, the features for engaging multiple CPUs is disabled. They require tuning because although MySQL 5.5 is multicore ready, MySQL 5.5 is only as multithreaded as you configure it.

Conclusion: You must configure MySQL to recognize that it has beefy hardware, more RAM, and a cooperative OS at its disposal.

MySQL Indexing – How to Optimize MySQL Setup for Faster Index Creation

For starters, I would not touch the buffer sizes just yet. The sizes youhave in the question are monstrously too big.

Here is another observation: You have BLOB data. Ouch, your temp table is going to eat space rather quickly. You could do somehting like this:

Create a 32GB RAM Disk called /var/tmpfs by adding this line to /etc/fstab

none                    /var/tmpfs              tmpfs   defaults,size=32g        1 2

Next, create a folder called /mysqltmp and mount the RAM disk on it

mkdir /mysqltmp
chown mysql:mysql /mysqltmp
mount /mysqltmp /var/tmpfs

Add this to my.cnf and restart mysql

[mysqld]
tmpdir=/mysqltmp

Now, any tmp table made via DDL lands in the RAM disk.

Here is yet another observation: Why not create a separate table that keeps the BLOB data away from the unique names?

CREATE TABLE `data_store_name` SELECT id,uniqname FROM `data_store` WHERE 1=2;
ALTER TABLE `data_store_name` ADD PRIMARY KEY (id);
ALTER TABLE `data_store_name` ADD UNIQUE KEY (uniqname);
ALTER TABLE `data_store_name` ADD INDEX name_id_ndx (uniqname,id);
INSERT INTO `data_store_name` SELECT id,uniqname FROM `data_store`;

This will prevent any moving around of BLOB data when indexing.

From here, you would have to always join data_store using its name like this:

SELECT
    A.uniqname,B.data
FROM
    (SELECT * FROM data_store_name WHERE uniqname = 'mydataname') A
    LEFT JOIN
    data_store B USING (id)
;

Making these changes will sidestep this whole mess of dealing with keycache, RAM disks, and tmp tables.

Give it a Try !!!

Best Answer

Note

Related Solutions

Thesqldump vs LOAD DATA INFILE

MySQL Indexing – How to Optimize MySQL Setup for Faster Index Creation

Related Question