MySQL Partitioning: Is there a performance tradeoff between number of partitions and size of each partition

MySQLpartitioningperformance

I have a large table (several 100 million rows) that I would like to efficiently partition. My question is whether there is a tradeoff between partition size and number of partitions. As far as I understand, most queries on a column used in the partition will be faster because the query will (for most queries) only have to search within the partition applicable to the query. Thus, it would make sense that, in order to maximize efficiency, you should divide a large table into the maximum number of partitions, therefore, making each partition as small as possible. In the case of MySQL, this means 1024 partitions. But is there any performance drawback to having a large number of partitions? Is so, how does one find the optimum number of partitions?

Note: There is a somewhat similar question on stackoverflow already, but only one answer, which (from my perspective) misses the mark. So I will state the question in my own way…hopefully it is more clear

Best Answer

Let's compare them

PARTITION SIZE

If you have the following:

100 million rows in a table
BTREE indexing
Each Page in the BTREE holds 1024 keys

What would the metrics look like?

Since LOG(100000000)/LOG(2) = 26.575424759099, a BTREE index with 1024 keys per page treenode would have a tree height of only 3 (CEILING(LOG(100000000)/LOG(1024))). With only three pages nodes, a binary search for the needed key in each accessed treenode would result in a pruning and isolating of about 30 keys.

NUMBER OF PARTITIONS

If you have the following:

100 million rows in a table
BTREE indexing
Each Page in the BTREE holds 1024 keys
You create 1024 parititions

The numbers would be slightly different.

Each partition should have about 97656 rows. What would the metrics become now?

Since LOG(97656)/LOG(2) = 16.575421065795, a BTREE index with 1024 keys per page treenode would have a tree height of only 2 (CEILING(LOG(97656)/LOG(1024))). With only two pages nodes, a binary search for the needed key in each accessed treenode would result in a pruning and isolating of about 20 keys.

CONCLUSION

Spreading out the keys just removes one tree level but essentially creates 1024 indexes. The queries won't know the difference. The search time would probably be nominal at best in favor of partitions. However, make sure all the data is active. Otheriwse, you may be hitting just a few partitions, while other partitions with rarely-accessed data just takes up space and are never accessed frequently enough to justify the partitioning. You may have different performance metrics to worry about that are more blatent (such as internal defragmentation in XFS, ext3 vs ext4, etc.) You also need to worry about which storage engine you are using because:

InnoDB indexing would be a little messier in comparison to MyISAM because of having to manage a clustered index
InnoDB does double writing of data in ibdata1 as well as the current log file (ib_logfile0 or ib_logfile1)

Related Solutions

Proper Database Partitioning – Best Practices

I will tell you my experience in defining the terms of "large", "big", "a lot":

a lot is a database that takes around 400 GB for the data of a full month (custom logging information from all our web apps)
big is a table in this database that contains much of this space :-).. about half of this size (200 GB a table, from what I kinda remember)
large is the data in a full day (about 12-15 GB) - which means a partition of that magic table

This is not a rule or a best practice.. but when you feel that your current indexing strategy is a bit lost and nothing seems to make your queries faster, I believe that it's time to take into account the partitioning.

SQL Server 2016 – Partitioning with Clustered and Non-Clustered Indexes

I don't think it's possible. The notion seems to fly in the face of everything that we know about partitioning and clustered indexes, but I might be wrong.

One possibility is a partitioned view over two tables, one with a clustered index, one as a heap. I have included a demo script for this, but it does come with a number of limitations which might be critical:

inserts seem to need to explicitly insert into each column
the heap table still needs a primary key which includes the partitioning column (even if the primary key is not clustered)
moving data from 'recent' to 'archive' involves moving from one table to another

Overall your situation is probably complex enough that it needs a much more considered analysis.

Use tempdb;

Create Table TestOld (
    id int,
    partition_column char(1) Constraint chk_partition_old Check (partition_column = 'O'),
    other_column char(100),
    Constraint pTestOld Primary Key Clustered (partition_column, id));

Create Table TestNew (
    id int,
    partition_column char(1) Constraint chk_partition_new Check (partition_column = 'N'),
    other_column char(100),
    Constraint pTestNew Primary Key Nonclustered (partition_column, id));
Go

Create View Test As
Select id, partition_column, other_column
  From TestOld
Union All
Select id, partition_column, other_column
  From TestNew;  
Go

Insert Into Test (id, partition_column, other_column)
Values  (1, 'O', ''),
        (2, 'N', '');

Select *
  From Test
  Where partition_column = 'N';

A second option which may have more merit in your case is a filtered index which you could add over the old part of your table and include a large number of columns so that it covers the queries that you do on the old part of the table (this would serve to speed up reads to the old part of the table but wouldn't slow writes in the recent part).