MYSQL: How to improve performance of inserting over 1M rows to a table with over 100M indexed rows

bulk-insertinnodbinsertMySQLperformance

I have this mysql table:

CREATE TABLE `codes` (
  `code` bigint(11) unsigned NOT NULL,
  `allocation` int(11) NOT NULL DEFAULT '0',
  `used` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`code`),
  KEY `allocation` (`allocation`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

When it is fully up and running it will hold anywhere between 100 million to 300 million codes that have been randomly generated from a number between 1 and 10 Trillion.

To fill the table I have this stored procedure:

DELIMITER ;;
CREATE DEFINER=`root`@`localhost` PROCEDURE `generate_codes_v4`(
    IN bf_codes_to_generate BIGINT,
    IN bf_lower_limit BIGINT,
    IN bf_upper_limit BIGINT,
    IN bf_allocation_num INT
)
BEGIN

    SET @Codes = bf_codes_to_generate;
    SET @Lower = bf_lower_limit;
    SET @Upper = bf_upper_limit;
    SET @Allocation = bf_allocation_num;

    SET @qry_rand = 'SELECT ROUND(((@Upper - @Lower -1) * RAND() + @Lower), 0) INTO @Random';
    PREPARE qry_rand_stmt FROM @qry_rand;

    SET @qry_insert = 'INSERT IGNORE INTO `codes` (`code`,`allocation`) VALUES ( @Random, @Allocation )';
    PREPARE qry_insert_stmt FROM @qry_insert;

    START TRANSACTION;

    WHILE @Codes > 0 DO

        EXECUTE qry_rand_stmt;
        EXECUTE qry_insert_stmt;

        SET @Codes = @Codes - ROW_COUNT();

    END WHILE;

    COMMIT;

    DEALLOCATE PREPARE qry_rand_stmt;
    DEALLOCATE PREPARE qry_insert_stmt;

END;;
DELIMITER ;

What this does is pick a random number between the given bounds and insert it into the table.

We currently use this stored procedure to insert anywhere between 500K to 5M rows at a time. While it works it starts to get very slow as more rows already exist in the table.

Once we get to 10M rows already in the table the generation process slows to about 1000 rows per second. As we ultimately plan to store 100M to 300M codes in this table the insertion process will take even longer at that point. Basically this table is not scaling well.

Is there anything that can be done to make this process scale better?

Here are some answers to questions I think you might ask

Q: Why an index on allocation column?
A: Each time a batch of rows are inserted we give it an allocation number. We need to be able to quickly get allow rows out that have a given allocation number.

Q: Why the use of a transaction?
A: Apparently this stops the index constantly been flushed to disk while inserting codes and in our testing speeded inserts up significantly. Also, while not implemented yet, we would like to be able to put a kill switch in place that can cancel a batch insert at any point in time.

Q: Why don't you split the table into multiple tables for example 1-1T goes into table one, 1T-2T goes to table two and so on?
A: We may have to look into doing this but I'd like to see if what we have now can be improved.

Q: Is there anything else we should know?
A: This table will be constantly been used as a lookup to check if codes exist and if they have been used and will be SELECT heavy. Any solution mustn't block this table from being read from and try and not slow its read performance too much.

Best Answer

If you have access to the server file system, I would suggest that you script (Perl, PHP, C++, etc.) the number generation into a flat file and perform a LOAD DATA INFILE operation.

Typically LOAD DATA INFILE performs faster than repeated INSERT statements for larger row sets and can also handle the IGNORE clause. Have a look at this answer regarding the bulk_insert_buffer_size variable, which is important when doing bulk inserts, should you elect to go with the LOAD DATA INFILE option.

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

Mysql – LOAD DATA stuck at Null State MySQL

Here is one thing that caught my eye when you replied to my comment: The target table is InnoDB and you are using LOAD DATA INFILE. I see two issues

ISSUE #1 : LOAD DATA INFILE

While LOAD DATA INFILE can load InnoDB tables, that command can be tuned for loading MyYSAM tables. There is only one option to do this: bulk_insert_buffer_size. either setting to very large or setting it to zero to disable it.

There is no synonymous provision for InnoDB.

ISSUE #2 : InnoDB Storage Engine

Let's take a look at the InnoDB Architecture

InnoDB Architecture

Now, picture yourself pushing 50 millions rows into one InnoDB table as a single transaction giving all the plumbing depicted in this elaborate illustration.

To ensure data consistency in the event of a crash, your data has to be written in three places:

There are 128 rollback segments in the System Tablespace (Physical File ibdata1). Your incoming table data must pile up on one Rollback Segment like defensive tackles on a quarterback.
You have an active Double Write Buffer in the System Tablespace. As the name implies, data is being written twice. InnoDB will write to the Double Write Buffer first before writing back to the .ibd files. Those data are used as source data for crash recovery.
The data are also being written in the Transaction Logs (Redo Logs in the Bottom Right Corner) via the Log Buffer. The physical files are ib_logfile0 and ib_logfile1.

My Perspective

InnoDB can handle 1024 current transactions but there are only 128 rollback segments. If there are other transactions going on, you got a New York City traffic jam on your hands. With all the InnoDB Internals to manage through your bulk insert, seeing NULL in the processlist should not be a surprise. You should look at four(4) things to make sure they are up-to-date:

filesize of ibdata1
timestamp on ibdata1
timestamp on ib_logfile0
timestamp on ib_logfile1

SUGGESTIONS

You could set one or more of the following

Disable the Double Write Buffer (set innodb_doublewrite to 0). please set it back to 1 afterwards.
Increase the InnoDB Buffer Pool (increase innodb_buffer_pool_size)
Increase the InnoDB Log File Size (set inndo_log_file_size to 2G)
Increase the InnoDB Log Buffer Size (set inndo_log_buffer_size to 512M)
Change Transaction Flush Behavior (set innodb_flush_log_at_trx_commit to 0). This will disables ACID Compliance (could lost 1 second of transactions upon a crash) but will increase InnoDB Write Performance.
Increase Write IO Thread (set innodb_write_io_threads to 64. You may as well set innodb_write_io_threads to 64)