Mysql – Is it realistic to expect a database of size over 50 GB (some 300+ million records) to be imported within 3-4 hours on a single server (MySQL)

database-designMySQLscalability

I have hundreds of millions of rows in a text/csv file (genomics database btw – each record is less than 255 characters long…).

Ideally I'd like to make them searchable since right now my best bet is spiting them (a little help from cygwin!) and reading them one by one as a text file ~500mb from notepad++ (yes…i know…) – so this is very inconvenient and caveman-like approach.

I'd like to use MySQL but maybe others, have budget of up to $500 for Amazon instances when needed – maybe 32gb ram some xeon gold and 200gb hard disk on Amazon can do it? No problem to use up to 10 instances each of which doing concurrent insert/loading.

I read someone had accomplished 300,000 rows/second using 'load data infile' on a local server with ssd and 32gb ram – if I make it to even 50,000 rows/second and then be able to query it with say phpmyadmin in normal time – I'd be happy. Thanks!

Best Answer

I read someone had accomplished 300,000 rows/second using 'load data infile' on a local server with ssd and 32gb ram

That sounds like taken from my blog posts (or at least those were my numbers and specs): https://jynus.com/dbahire/testing-the-fastest-way-to-import-a-table-into-mysql-and-some-interesting-5-7-performance-results/ and https://jynus.com/dbahire/testing-again-load-data-on-mysql-5-6-5-7-8-0-non-ga-and-mariadb-10-0-10-1-and-10-2-non-ga/

As you can see my experience is based on actual tests; but not only laboratory tests like the above, I do those because they help me be ready to make sure my database backups (and recoveries) are generated correctly, as well as they can be performed reliably and fastly, handling daily both logical dumps and snapshots for the half a petabyte of data we store on our MariaDB databases: https://www.slideshare.net/jynus/backing-up-wikipedia-databases

A 50 GB database on a 32GB memory server is a very generous ratio, where 60% of the data could fit in the buffer pool. In that case, thoughput can be optimized greatly, as long as you setup your vm, os and mysql configuration for it (disabling the binary log, increasing the buffer pool and transaction log files, loosening consistency parameters during the import, etc). You will also want the original format to be optimized for easy loading, so you dont waste cpu cycles on parsing or converting the format or other changes, as well as doing it in large transactions as well as on several threads in parallel if possible.

As an example, my production has 1-2TB databases with billions of rows, which I can recover logically in 6-12 hours on a 512GB memory machine, including the many indexes.

Under the above right circumstances, with a mostly in memory database, I would be able to load remotelly in parallel a 50 GB database in around 30 minutes. Less than 1h if the storage is slow. Be careful because the tests asume dedicated resources; a cpu, memory or io limitation can create a bottleneck, leading to higher load times.