Mysql – Large INSERT INTO SELECT [..] FROM gradually gets slower

insertMySQLperformancequery-performance

I've written a program that does a large INSERT in batches of 100,000 and shows its progress.

The source table contains 2.5GB of data:

CREATE TABLE wikt.text (
    old_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    old_text MEDIUMBLOB NOT NULL,
    old_flags TINYBLOB NOT NULL,

    PRIMARY KEY (old_id),
    KEY old_id (old_id)

)   ENGINE=INNODB
    AUTO_INCREMENT=23565544
    DEFAULT CHARSET=binary;

CREATE INDEX old_id ON text (old_id);

This is the destination table:

CREATE TABLE domains.dictionary_language (
    text_id     INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    english     TINYINT(1) UNSIGNED NOT NULL,

    PRIMARY KEY (text_id),
    KEY         english (english)

)   ENGINE=INNODB
    AUTO_INCREMENT=23565544;

This is the query being run in batches of 100k:

INSERT INTO domains.dictionary_language
    SELECT      old_id,
                IF(old_text LIKE '%==English==%', 1, 0)

    FROM        wikt.text

    LIMIT       {batch}, 100000;

The query is getting slower and slower. The first 1 million records inserted in 8 minutes. After that, records #1.2m – #1.3m alone took 7 mins. Now #2.3m – #2.4m just finished in 15 mins.

Here's the log of how long each batch of 100k takes to import.

As you can see, the first 12 batches (1.2 million records) insert in < 1 minute each. After that, the performance drops, with each batch taking a bit longer than the last!

24/11/2013 19:18:40 Ready

24/11/2013 19:18:42 Dictionary import started from Wiktionary
24/11/2013 19:18:42 Records:    3,729,613
24/11/2013 19:18:42 Batches of: 100,000

24/11/2013 19:19:11 Batch 1 finished in 00:00:29.3146767
24/11/2013 19:19:33 Batch 2 finished in 00:00:22.2142706
24/11/2013 19:19:41 Batch 3 finished in 00:00:07.6104353
24/11/2013 19:19:53 Batch 4 finished in 00:00:12.7057267
24/11/2013 19:20:08 Batch 5 finished in 00:00:14.9248537
24/11/2013 19:20:25 Batch 6 finished in 00:00:16.9849715
24/11/2013 19:20:43 Batch 7 finished in 00:00:17.7930177
24/11/2013 19:20:49 Batch 8 finished in 00:00:06.2453572
24/11/2013 19:21:07 Batch 9 finished in 00:00:17.2549869
24/11/2013 19:21:38 Batch 10 finished in 00:00:31.4577993
24/11/2013 19:22:02 Batch 11 finished in 00:00:23.7003556
24/11/2013 19:22:17 Batch 12 finished in 00:00:15.4078813
24/11/2013 19:23:40 Batch 13 finished in 00:01:22.9637452
24/11/2013 19:25:25 Batch 14 finished in 00:01:44.8639979
24/11/2013 19:27:40 Batch 15 finished in 00:02:15.1387295
24/11/2013 19:30:07 Batch 16 finished in 00:02:26.7553939
24/11/2013 19:33:01 Batch 17 finished in 00:02:54.3109701
24/11/2013 19:36:17 Batch 18 finished in 00:03:15.8252006
24/11/2013 19:39:57 Batch 19 finished in 00:03:40.1275906
24/11/2013 19:44:28 Batch 20 finished in 00:04:30.3824650
24/11/2013 19:49:48 Batch 21 finished in 00:05:20.6873423
24/11/2013 19:55:45 Batch 22 finished in 00:05:56.7674059
24/11/2013 20:02:37 Batch 23 finished in 00:06:52.0925703
24/11/2013 20:10:32 Batch 24 finished in 00:07:54.8921622
24/11/2013 20:18:12 Batch 25 finished in 00:07:39.9433072
24/11/2013 20:26:34 Batch 26 finished in 00:08:21.4696824
24/11/2013 20:36:00 Batch 27 finished in 00:09:26.3163915
24/11/2013 20:45:07 Batch 28 finished in 00:09:07.1472950
24/11/2013 20:54:48 Batch 29 finished in 00:09:41.0222326
24/11/2013 21:04:19 Batch 30 finished in 00:09:31.2316726
24/11/2013 21:14:35 Batch 31 finished in 00:10:15.3521962
24/11/2013 21:25:10 Batch 32 finished in 00:10:34.9583176
24/11/2013 21:36:27 Batch 33 finished in 00:11:17.6047568
24/11/2013 21:47:52 Batch 34 finished in 00:11:24.3261412
24/11/2013 21:59:32 Batch 35 finished in 00:11:40.2410515
24/11/2013 22:12:31 Batch 36 finished in 00:12:59.1605654
24/11/2013 22:26:12 Batch 37 finished in 00:13:40.9209540
24/11/2013 22:40:12 Batch 38 finished in 00:13:59.8160347

24/11/2013 22:40:12 Dictionary import finished

It's getting slower and slower with each batch of 100k! Why? Here's the EXPLAIN output.

The server itself is tuned up with a 4GB buffer pool etc.

Best Answer

The large offsets can have this effect. I would try to remove the offset and use only LIMIT 10000:

INSERT INTO domains.dictionary_language
  (text_id, english)
    SELECT      t.old_id,
                IF(t.old_text LIKE '%==English==%', 1, 0)

    FROM        wikt.text AS t
      JOIN      ( SELECT COALESCE(MAX(text_id), 0) AS offset
                  FROM domains.dictionary_language
                ) AS m
                ON  t.old_id > m.offset

    ORDER BY    t.old_id
    LIMIT       100000;