Mysql – Import Wikipedia .sql dump very slow

importMySQLperformance

I'm having troubles to import one of the SQl dumps of Wikipedia into a local database. I've downloaded the categorylinks.sql from https://dumps.wikimedia.org/enwiki/latest/ which is around 17GB when unpacked and I tried to load it into a local MySql database via the command line with mysql -uroot > path/to/file
I monitor the progress my using pipeviewer in front of the command and it begins with ok speed (1-2MB/s) but it gets slower and slower over time, currently sitting at 18KiB/s after a weekend of running and it gives me an ETA of 4 days and that ETA is rather rising then going down.

The first few INSERT statements run at this speed:

Query OK, 7352 rows affected, 64 warnings (0.48 sec)
Records: 7352  Duplicates: 0  Warnings: 64

Query OK, 7145 rows affected, 64 warnings (0.71 sec)
Records: 7145  Duplicates: 0  Warnings: 64

Query OK, 7139 rows affected, 64 warnings (0.66 sec)
Records: 7139  Duplicates: 0  Warnings: 64`

And currently, after 3 days it slows down to:

Query OK, 7166 rows affected, 64 warnings (5 min 25.89 sec)
Records: 7166  Duplicates: 0  Warnings: 64

Query OK, 7013 rows affected, 64 warnings (5 min 8.13 sec)
Records: 7013  Duplicates: 0  Warnings: 64

The warnings are all just "Invalid utf8mb4 character string", so those should not be the problem, right? And the duplicate checks are turned of by setting unique_checks to 0.

I've already tried to tweak many options, most importantly setting autocommit, unique_checks, and foreign_key_checks to 0. Also have set innodb_flush_log_at_trx_commit = 2 as well as playing around with the values of innodb_buffer_pool_size, innodb_log_buffer_size, innodb_log_file_size, innodb_write_io_threads.

Neither RAM (16GB), nor I/O write speed are a problem as the process does not use nearly the available resources according to htop and iotop.
I've read of other people importing >60GB of data in less than 4 hours, so it should be possible to speed it up somehow?

Do you have any recommendations on what other options to change?

Would it be an option to split the SQL file into many smaller files and import them separately? Can you do that parallel?

Thank you very much in advance!

Best Answer

Just in case someone else comes looking for the same type of information. I decided that I'd like to import only a subset of wikipedia pages (the most popular subset) and built a couple of tools that build the list of most popular wikipedia pages (from the monthly logs) and filter the sqldump data files before import.