Mysql – Is it realistic to expect a database of size over 50 GB (some 300+ million records) to be imported within 3-4 hours on a single server (MySQL)

database-designMySQLscalability

I have hundreds of millions of rows in a text/csv file (genomics database btw – each record is less than 255 characters long…).

Ideally I'd like to make them searchable since right now my best bet is spiting them (a little help from cygwin!) and reading them one by one as a text file ~500mb from notepad++ (yes…i know…) – so this is very inconvenient and caveman-like approach.

I'd like to use MySQL but maybe others, have budget of up to $500 for Amazon instances when needed – maybe 32gb ram some xeon gold and 200gb hard disk on Amazon can do it? No problem to use up to 10 instances each of which doing concurrent insert/loading.

I read someone had accomplished 300,000 rows/second using 'load data infile' on a local server with ssd and 32gb ram – if I make it to even 50,000 rows/second and then be able to query it with say phpmyadmin in normal time – I'd be happy. Thanks!

Best Answer

I read someone had accomplished 300,000 rows/second using 'load data infile' on a local server with ssd and 32gb ram

That sounds like taken from my blog posts (or at least those were my numbers and specs): https://jynus.com/dbahire/testing-the-fastest-way-to-import-a-table-into-mysql-and-some-interesting-5-7-performance-results/ and https://jynus.com/dbahire/testing-again-load-data-on-mysql-5-6-5-7-8-0-non-ga-and-mariadb-10-0-10-1-and-10-2-non-ga/

As you can see my experience is based on actual tests; but not only laboratory tests like the above, I do those because they help me be ready to make sure my database backups (and recoveries) are generated correctly, as well as they can be performed reliably and fastly, handling daily both logical dumps and snapshots for the half a petabyte of data we store on our MariaDB databases: https://www.slideshare.net/jynus/backing-up-wikipedia-databases

A 50 GB database on a 32GB memory server is a very generous ratio, where 60% of the data could fit in the buffer pool. In that case, thoughput can be optimized greatly, as long as you setup your vm, os and mysql configuration for it (disabling the binary log, increasing the buffer pool and transaction log files, loosening consistency parameters during the import, etc). You will also want the original format to be optimized for easy loading, so you dont waste cpu cycles on parsing or converting the format or other changes, as well as doing it in large transactions as well as on several threads in parallel if possible.

As an example, my production has 1-2TB databases with billions of rows, which I can recover logically in 6-12 hours on a 512GB memory machine, including the many indexes.

Under the above right circumstances, with a mostly in memory database, I would be able to load remotelly in parallel a 50 GB database in around 30 minutes. Less than 1h if the storage is slow. Be careful because the tests asume dedicated resources; a cpu, memory or io limitation can create a bottleneck, leading to higher load times.

Related Solutions

Mysql – Maximum table size for MySQL database server running on Windows (NTFS) Table type InnoDB

If you enable MySQL with innodb_file_per_table, each InnoDB table will have its own dedicated tablespace. The system tablespace file, bwetter known as ibdata1, will grow very slowly.

If innodb_file_per_table is disable, everything and its grandmother goes into ibdata1. That will grow quickly.

If you are not using innodb_file_per_table, you need to rearchitect InnoDB's infrastructure. This will give you a limit of 2TB per table instead of 2TB for all InnoDB. This is the case for ext3. On ext4, the table size can be up to 16TB.

Please see my past posts on how to do that

Mysql – why/how does the number of matched columns influences the way of excecuting a query

I can provide with a general explanation, but it may not apply specifically to your particular case:

The way decision making works is by evaluation cost of execution plan, then picking up what is hopefully the cheapest plan. This you already know.

When it comes to indexing, though, stuff are getting interesting. The way to evaluate the usefulness or viability of an index is to estimate the selectivity given some value.

For the moment, forget about your FULLTEXT index, and let's assume a simple index on some column col1, and another index on some column col2. Given the following two queries:

SELECT * FROM t WHERE col1 < 10 and col2 = 4;
SELECT * FROM t WHERE col1 BETWEEN 100 AND 110 and col2 = 4;

It may happen that the query is evaluated differently in these two cases. Why? Because it may happen that col2 = 4 returns more rows than col1 < 10, in which case we prefer to use index on col1. But then, it may return less rows than col1 BETWEEN 100 AND 110, in which case we prefer the index on col2.

Your case is not very much different. MySQL estimates the number of rows returned by some index query. When you use more columns, MySQL gets the impression your index is likely to result with few rows. So it chooses to start with TableA, then joins what should be very few rows with TableB.

But if MySQL believes the index to return many rows, it may prefer starting with TableB. Why is that? Because you are sorting on indexed columns of TableB. Sorting is a lot of work, too. So MySQL may choose to first sort the rows, then join to TableA and filter by fulltext index. It may not be a bad idea if the fulltext search yields with many rows anyhow.

Best Answer

Related Solutions

Mysql – Maximum table size for MySQL database server running on Windows (NTFS) Table type InnoDB

Mysql – why/how does the number of matched columns influences the way of excecuting a query

Related Question