Mysql – Import Wikipedia .sql dump very slow

importMySQLperformance

I'm having troubles to import one of the SQl dumps of Wikipedia into a local database. I've downloaded the categorylinks.sql from https://dumps.wikimedia.org/enwiki/latest/ which is around 17GB when unpacked and I tried to load it into a local MySql database via the command line with mysql -uroot > path/to/file
I monitor the progress my using pipeviewer in front of the command and it begins with ok speed (1-2MB/s) but it gets slower and slower over time, currently sitting at 18KiB/s after a weekend of running and it gives me an ETA of 4 days and that ETA is rather rising then going down.

The first few INSERT statements run at this speed:

Query OK, 7352 rows affected, 64 warnings (0.48 sec)
Records: 7352  Duplicates: 0  Warnings: 64

Query OK, 7145 rows affected, 64 warnings (0.71 sec)
Records: 7145  Duplicates: 0  Warnings: 64

Query OK, 7139 rows affected, 64 warnings (0.66 sec)
Records: 7139  Duplicates: 0  Warnings: 64`

And currently, after 3 days it slows down to:

Query OK, 7166 rows affected, 64 warnings (5 min 25.89 sec)
Records: 7166  Duplicates: 0  Warnings: 64

Query OK, 7013 rows affected, 64 warnings (5 min 8.13 sec)
Records: 7013  Duplicates: 0  Warnings: 64

The warnings are all just "Invalid utf8mb4 character string", so those should not be the problem, right? And the duplicate checks are turned of by setting unique_checks to 0.

I've already tried to tweak many options, most importantly setting autocommit, unique_checks, and foreign_key_checks to 0. Also have set innodb_flush_log_at_trx_commit = 2 as well as playing around with the values of innodb_buffer_pool_size, innodb_log_buffer_size, innodb_log_file_size, innodb_write_io_threads.

Neither RAM (16GB), nor I/O write speed are a problem as the process does not use nearly the available resources according to htop and iotop.
I've read of other people importing >60GB of data in less than 4 hours, so it should be possible to speed it up somehow?

Do you have any recommendations on what other options to change?

Would it be an option to split the SQL file into many smaller files and import them separately? Can you do that parallel?

Thank you very much in advance!

Best Answer

Just in case someone else comes looking for the same type of information. I decided that I'd like to import only a subset of wikipedia pages (the most popular subset) and built a couple of tools that build the list of most popular wikipedia pages (from the monthly logs) and filter the sqldump data files before import.

SUGGESTIONS

SUGGESTION #1

My first suggestion for importing this rather large table would be

Drop all the non-unique indexes
Import the data
Create all the non-unique indexes

SUGGESTION #2

Get rid of duplicate indexes. In your case, you have

KEY `party_id` (`party_id`),
KEY `party_id_2` (`party_id`,`status`)

Both indexes start with party_id, you can increase secondary index processing by at least 7.6 % getting rid one index out of 13. You need to eventually run

ALTER TABLE monster DROP INDEX party_id;

SUGGESTION #3

Get rid of indexes you do not use. Look over your application code and see if your queries use all the indexes. You may want to look into pt-index-usage to let it suggest what indexes are not being used.

SUGGESTION #4

You should increase the innodb_log_buffer_size to 64M since the default is 8M. A bigger log buffer may increase InnoDB write I/O performance.

EPILOGUE

Putting the first two suggestions in place, do the following:

Drop the 13 non-unique indexes
Import the data
Create all the non-unique indexes except the party_id index

Perhaps the following may help

CREATE TABLE monster_new LIKE monster;
ALTER TABLE monster_new
  DROP INDEX `party_id`,
  DROP INDEX `creation_date`,
  DROP INDEX `email`,
  DROP INDEX `hash`,
  DROP INDEX `address_hash`,
  DROP INDEX `thumbs3`,
  DROP INDEX `ext_monster_id`,
  DROP INDEX `status`,
  DROP INDEX `note`,
  DROP INDEX `postcode`,
  DROP INDEX `some_id`,
  DROP INDEX `cookie`,
  DROP INDEX `party_id_2`;
ALTER TABLE monster RENAME monster_old;
ALTER TABLE monster_new RENAME monster;

Import the data into monster. Then, run this

ALTER TABLE monster
  ADD INDEX `creation_date`,
  ADD INDEX `email` (`email`(4)),
  ADD INDEX `hash` (`hash`(8)),
  ADD INDEX `address_hash` (`address_hash`(8)),
  ADD INDEX `thumbs3` (`thumbs3`),
  ADD INDEX `ext_monster_id` (`ext_monster_id`),
  ADD INDEX `status` (`status`),
  ADD INDEX `note` (`note`(4)),
  ADD INDEX `postcode` (`postcode`),
  ADD INDEX `some_id` (`some_id`),
  ADD INDEX `cookie` (`cookie`),
  ADD INDEX `party_id_2` (`party_id`,`status`);

GIVE IT A TRY !!!

ALTERNATIVE

You could create a table called monster_csv as a MyISAM table with no indexes and do this:

CREATE TABLE monster_csv ENGINE=MyISAM AS SELECT * FROM monster WHERE 1=2;
ALTER TABLE monster RENAME monster_old;
CREATE TABLE monster LIKE monster_old;
ALTER TABLE monster DROP INDEX `party_id`;

Import your data into monster_csv. Then, use mysqldump to create another import

mysqldump -t -uroot -p mydb monster_csv | sed 's/monster_csv/monster/g' > data.sql

The mysqldump file data.sql will extended INSERT commands importing 10,000-20,000 rows at a time.

Now, just load the mysqldump

mysql -uroot -p mydb < data.sql

Finally, get rid of the MyISAM table

DROP TABLE monster_csv;

Mysql – adding a column with default value to slow in thesq 5.1

That's true.

In 5.1 (and 5.5), adding a column (of any type) to a table required locking the table and copying the entire table over, plus rebuilding all the indexes. 5.6 improves significantly on the task -- many ALTERs can be done with little or no interruption of other activity. (The time taken varies with the task.)

You should plan for upgrading to 5.5, 5.6, or even 5.7. The farther behind you get, the harder it is to upgrade.

innodb-buffer-pool-size = 14G is dangerously high for a 16GB machine. 11G would be safer. If MySQL "swaps", performance becomes terrible.

The table has an awful lot of indexes.

Are you familiar with "composite" indexes?