Mysql – Handling a very large MySQL database

innodbMySQLperformanceperformance-tuning

Sorry for the long post!

I have a database containing ~30 tables (InnoDB engine). Only two of these tables, namely, "transaction" and "shift" are quite large (the first one have 1.5 million rows and shift has 23k rows). Now everything works fine and I don't have problem with the current database size.

However, we will have a similar database (same datatypes, design ,..) but much larger, e.g., the "transaction" table will have about 1 billion records (about 2,3 million transaction per day) and we are thinking about how we should deal with such volume of data in MySQL? (it is both read and write intensive). I read a lot of related posts to see if Mysql (and more specifically InnoDB engine) can perform well with billions of records, but still I have some questions. Some of those related posts that I've read are in the following:

What I've understood so far to improve the performance for very large tables:

(for innoDB tables which is my case) increasing the innodb_buffer_pool_size (e.g., up to 80% of RAM).
Also, I found some other MySQL performance tunning settings here in
percona blog
having proper indexes on the table (using EXPLAN on queries)
partitioning the table
MySQL Sharding or clustering

Here are my questions/confusions:

About partitioning, I have some doubts whether we should use it or not. On one hand many people suggested it to improve performance when table is very large. On the other hand, I've read many posts saying it does not improve query performance and it does not make queries run faster (e.g., here and here). Also, I read in MySQL Reference Manual that InnoDB foreign keys and MySQL partitioning are not compatible (we have foreign keys).
Regarding indexes, right now they perform well, but as far as I understood, for very large tables indexing is more restrictive (as Kevin Bedell mentioned in his answer here). Also, indexes speed up reads while slow down write (insert/update). So, for the new similar project that we will have this large DB, should we first insert/load all the data and then create indexes? (to speed up the insert)
If we cannot use partitioning for our big table ("transaction" table), what is an alternative option to improve the performance? (except MySQl variable settings such as innodb_buffer_pool_size). Should we use Mysql clusters? (we have also lots of joins)

Thanks for your time,

Best Answer

Re: partitioning:

This is by far the best way to deal with large data sets. By allowing multiple indexes to run across different ranges instead of one index for the whole set, each individual index will stay at much higher quality.

If you can configure your application to maintain referential integrity itself, then you can safely drop the foreign keys. You'll have to make sure that referenced rows in child tables are updated appropriately whenever the parent row is updated. The database will no longer prevent you from messing that up, and cascade operations won't be available anymore. So you would need to program that into your application. Creating triggers to do it automatically will help.

Re: Indexing:

B+Tree indexes will start to perform poorly if the depth gets too high. The post you linked contains some good information. e.g. forget about even trying to access columns without an index.

For writes, if you have periodic content loads then it would make sense to drop the indexes before bulk insertion and recreate them afterwards. This would likely be faster than sequential individual inserts to both table and index. Partitioning would make this easier, as you could insert all data into a new partition, and then index it after.

Re: Alternative options

Use a better database. ;-) You will really start to feel the limitations of MySQL if your database grows to this scale. Other DBMSes offer a far more capable set of tools for dealing with data of this scope. Which database that is depends on your budget, use cases and contraints. MySQL may very well be "good enough", but you should definitely evaluate alternatives before diving in.

Re: Clustering

Clustering is better in some situations, worse in others. e.g. It will allow you to shard data, but sharding is just horizontal partitioning, so will have the same restrictions on foreign keys. Maintaining a cluster can also create a lot of overhead, particularly for write-intensive applications.

Related Solutions

MySQL 5.5.8 InnoDB Foreign Key, JOIN performance

Foreign key relationships are to enforce data integrity, not for query performance, that is what indexes are for. Also note that InnoDB creates an index on each column with a foreign key relationship.

However I would recommend having the foreign key relationships to ensure that the data is always valid, especially when updating and deleting, which may become significant when you have to start archiving data.

Mysql – How large should be thesql innodb_buffer_pool_size

Your innodb_buffer_pool_size is enormous. You have it set at 20971520000. That's 19.5135 GB. If you only have 5GB of InnoDB data and indexes, then you should only have about 8GB. Even this may be too high.

Here is what you should do. First run this query

SELECT CEILING(Total_InnoDB_Bytes*1.6/POWER(1024,3)) RIBPS FROM
(SELECT SUM(data_length+index_length) Total_InnoDB_Bytes
FROM information_schema.tables WHERE engine='InnoDB') A;

This will give you the RIBPS, Recommended InnoDB Buffer Pool Size, based on all InnoDB Data and Indexes, with an additional 60%.

For Example

mysql>     SELECT CEILING(Total_InnoDB_Bytes*1.6/POWER(1024,3)) RIBPS FROM
    ->     (SELECT SUM(data_length+index_length) Total_InnoDB_Bytes
    ->     FROM information_schema.tables WHERE engine='InnoDB') A;
+-------+
| RIBPS |
+-------+
|     8 |
+-------+
1 row in set (4.31 sec)

mysql>

With this output, you would set the following in /etc/my.cnf

[mysqld]
innodb_buffer_pool_size=8G

Next, service mysql restart

After the restart, run MySQL for a week or two. Then, run this query:

SELECT (PagesData*PageSize)/POWER(1024,3) DataGB FROM
(SELECT variable_value PagesData
FROM information_schema.global_status
WHERE variable_name='Innodb_buffer_pool_pages_data') A,
(SELECT variable_value PageSize
FROM information_schema.global_status
WHERE variable_name='Innodb_page_size') B;

This will give you how many actual GB of memory is in use by InnoDB Data in the InnoDB Buffer Pool at this moment.

I have written about this before : What to set innodb_buffer_pool and why..?

You could just run this DataGB query right now rather than reconfiguring, restarting and waiting a week.

This value DataGB more closely resembles how big the InnoDB Buffer Pool should be + (percentage specified in innodb_change_buffer_max_size). I am sure this will be far less than the 20000M you have reserved right now. The savings in RAM can be used for tuning other things like

CAVEAT #1

This is very important to note: At times, InnoDB may require an additional 10% over the value for the innodb_buffer_pool_size. Here is what the MySQL Documentation says on this:

The larger you set this value, the less disk I/O is needed to access data in tables. On a dedicated database server, you may set this to up to 80% of the machine physical memory size. Be prepared to scale back this value if these other issues occur:

Competition for physical memory might cause paging in the operating system.

InnoDB reserves additional memory for buffers and control structures, so that the total allocated space is approximately 10% greater than the specified size.

The address space must be contiguous, which can be an issue on Windows systems with DLLs that load at specific addresses.

The time to initialize the buffer pool is roughly proportional to its size. On large installations, this initialization time may be significant. For example, on a modern Linux x86_64 server, initialization of a 10GB buffer pool takes approximately 6 seconds. See Section 8.9.1, “The InnoDB Buffer Pool”.

CAVEAT #2

I See the following values in your my.cnf

| innodb_io_capacity                                | 200 |
| innodb_read_io_threads                            | 4   |
| innodb_thread_concurrency                         | 4   |
| innodb_write_io_threads                           | 4   |

These number will impede InnoDB from accessing multiple cores

Please set the following:

[mysqld]
innodb_io_capacity = 2000
innodb_read_io_threads = 64
innodb_thread_concurrency = 0
innodb_write_io_threads = 64

I have written about this before in the DBA StackExchange

May 26, 2011: About single threaded versus multithreaded databases performance
Sep 12, 2011: Possible to make MySQL use more than one core?
Sep 20, 2011: Multi cores and MySQL Performance

I just answered a question like this in Server Fault using a more concise formula:

SELECT CONCAT(CEILING(RIBPS/POWER(1024,pw)),SUBSTR(' KMGT',pw+1,1))
Recommended_InnoDB_Buffer_Pool_Size FROM
(
    SELECT RIBPS,FLOOR(LOG(RIBPS)/LOG(1024)) pw
    FROM
    (
        SELECT SUM(data_length+index_length)*1.1*growth RIBPS
        FROM information_schema.tables AAA,
        (SELECT 1.25 growth) BBB
        WHERE ENGINE='InnoDB'
    ) AA
) A;