MYSQL: How to improve performance of inserting over 1M rows to a table with over 100M indexed rows

bulk-insertinnodbinsertMySQLperformance

I have this mysql table:

CREATE TABLE `codes` (
  `code` bigint(11) unsigned NOT NULL,
  `allocation` int(11) NOT NULL DEFAULT '0',
  `used` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`code`),
  KEY `allocation` (`allocation`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

When it is fully up and running it will hold anywhere between 100 million to 300 million codes that have been randomly generated from a number between 1 and 10 Trillion.

To fill the table I have this stored procedure:

DELIMITER ;;
CREATE DEFINER=`root`@`localhost` PROCEDURE `generate_codes_v4`(
    IN bf_codes_to_generate BIGINT,
    IN bf_lower_limit BIGINT,
    IN bf_upper_limit BIGINT,
    IN bf_allocation_num INT
)
BEGIN

    SET @Codes = bf_codes_to_generate;
    SET @Lower = bf_lower_limit;
    SET @Upper = bf_upper_limit;
    SET @Allocation = bf_allocation_num;

    SET @qry_rand = 'SELECT ROUND(((@Upper - @Lower -1) * RAND() + @Lower), 0) INTO @Random';
    PREPARE qry_rand_stmt FROM @qry_rand;

    SET @qry_insert = 'INSERT IGNORE INTO `codes` (`code`,`allocation`) VALUES ( @Random, @Allocation )';
    PREPARE qry_insert_stmt FROM @qry_insert;

    START TRANSACTION;

    WHILE @Codes > 0 DO

        EXECUTE qry_rand_stmt;
        EXECUTE qry_insert_stmt;

        SET @Codes = @Codes - ROW_COUNT();

    END WHILE;

    COMMIT;

    DEALLOCATE PREPARE qry_rand_stmt;
    DEALLOCATE PREPARE qry_insert_stmt;

END;;
DELIMITER ;

What this does is pick a random number between the given bounds and insert it into the table.

We currently use this stored procedure to insert anywhere between 500K to 5M rows at a time. While it works it starts to get very slow as more rows already exist in the table.

Once we get to 10M rows already in the table the generation process slows to about 1000 rows per second. As we ultimately plan to store 100M to 300M codes in this table the insertion process will take even longer at that point. Basically this table is not scaling well.

Is there anything that can be done to make this process scale better?

Here are some answers to questions I think you might ask

Q: Why an index on allocation column?
A: Each time a batch of rows are inserted we give it an allocation number. We need to be able to quickly get allow rows out that have a given allocation number.

Q: Why the use of a transaction?
A: Apparently this stops the index constantly been flushed to disk while inserting codes and in our testing speeded inserts up significantly. Also, while not implemented yet, we would like to be able to put a kill switch in place that can cancel a batch insert at any point in time.

Q: Why don't you split the table into multiple tables for example 1-1T goes into table one, 1T-2T goes to table two and so on?
A: We may have to look into doing this but I'd like to see if what we have now can be improved.

Q: Is there anything else we should know?
A: This table will be constantly been used as a lookup to check if codes exist and if they have been used and will be SELECT heavy. Any solution mustn't block this table from being read from and try and not slow its read performance too much.

Best Answer

If you have access to the server file system, I would suggest that you script (Perl, PHP, C++, etc.) the number generation into a flat file and perform a LOAD DATA INFILE operation.

Typically LOAD DATA INFILE performs faster than repeated INSERT statements for larger row sets and can also handle the IGNORE clause. Have a look at this answer regarding the bulk_insert_buffer_size variable, which is important when doing bulk inserts, should you elect to go with the LOAD DATA INFILE option.