Mysql – Handling a very large MySQL database

innodbMySQLperformanceperformance-tuning

Sorry for the long post!

I have a database containing ~30 tables (InnoDB engine). Only two of these tables, namely, "transaction" and "shift" are quite large (the first one have 1.5 million rows and shift has 23k rows). Now everything works fine and I don't have problem with the current database size.

However, we will have a similar database (same datatypes, design ,..) but much larger, e.g., the "transaction" table will have about 1 billion records (about 2,3 million transaction per day) and we are thinking about how we should deal with such volume of data in MySQL? (it is both read and write intensive). I read a lot of related posts to see if Mysql (and more specifically InnoDB engine) can perform well with billions of records, but still I have some questions. Some of those related posts that I've read are in the following:

What I've understood so far to improve the performance for very large tables:

  1. (for innoDB tables which is my case) increasing the innodb_buffer_pool_size (e.g., up to 80% of RAM).
    Also, I found some other MySQL performance tunning settings here in
    percona blog
  2. having proper indexes on the table (using EXPLAN on queries)
  3. partitioning the table
  4. MySQL Sharding or clustering

Here are my questions/confusions:

  • About partitioning, I have some doubts whether we should use it or not. On one hand many people suggested it to improve performance when table is very large. On the other hand, I've read many posts saying it does not improve query performance and it does not make queries run faster (e.g., here and here). Also, I read in MySQL Reference Manual that InnoDB foreign keys and MySQL partitioning are not compatible (we have foreign keys).

  • Regarding indexes, right now they perform well, but as far as I understood, for very large tables indexing is more restrictive (as Kevin Bedell mentioned in his answer here). Also, indexes speed up reads while slow down write (insert/update). So, for the new similar project that we will have this large DB, should we first insert/load all the data and then create indexes? (to speed up the insert)

  • If we cannot use partitioning for our big table ("transaction" table), what is an alternative option to improve the performance? (except MySQl variable settings such as innodb_buffer_pool_size). Should we use Mysql clusters? (we have also lots of joins)

Thanks for your time,

Best Answer

Re: partitioning:

This is by far the best way to deal with large data sets. By allowing multiple indexes to run across different ranges instead of one index for the whole set, each individual index will stay at much higher quality.

If you can configure your application to maintain referential integrity itself, then you can safely drop the foreign keys. You'll have to make sure that referenced rows in child tables are updated appropriately whenever the parent row is updated. The database will no longer prevent you from messing that up, and cascade operations won't be available anymore. So you would need to program that into your application. Creating triggers to do it automatically will help.

Re: Indexing:

B+Tree indexes will start to perform poorly if the depth gets too high. The post you linked contains some good information. e.g. forget about even trying to access columns without an index.

For writes, if you have periodic content loads then it would make sense to drop the indexes before bulk insertion and recreate them afterwards. This would likely be faster than sequential individual inserts to both table and index. Partitioning would make this easier, as you could insert all data into a new partition, and then index it after.

Re: Alternative options

Use a better database. ;-) You will really start to feel the limitations of MySQL if your database grows to this scale. Other DBMSes offer a far more capable set of tools for dealing with data of this scope. Which database that is depends on your budget, use cases and contraints. MySQL may very well be "good enough", but you should definitely evaluate alternatives before diving in.

Re: Clustering

Clustering is better in some situations, worse in others. e.g. It will allow you to shard data, but sharding is just horizontal partitioning, so will have the same restrictions on foreign keys. Maintaining a cluster can also create a lot of overhead, particularly for write-intensive applications.