Mysql – When range partitioning, should I drop primary UUID-like key

MySQLmysql-5.6partitioning

I'm aware of the limitation "every unique key on the table must use every column in the table's partitioning expression". So given that my primary key is a UUID, it's going to require a change.

Background: I have a transaction history table that grows at about 1.5G per month (more so than I can keep upgrading server memory). The table has a PRIMARY key (uuid_short) and a datetime column, among others. There are two types of queries hitting the table: those that return a list of transactions in a date-range, and those that refer to a specific transaction through its primary key.

Question: Since vast majority of queries don't care for transactions more than 18 months old I thought datetime could be a candidate for range partitioning. Does this mean I'm better off dropping the PRIMARY key and converting it to a non-unique key? (dups are not a problem, the app servers guarantee there cannot be any).

The system is currently a hybrid of OLTP and OLAP scenarios which is probably why it doesn't scale.

Is partitioning going to make a positive impact on memory footprint?

If not, then I may be better off reorganizing the system by moving historical transactions to a data warehouse like AWS Redshift. It would be quite a refactoring effort, but if that's the preferred solution then I'd like to plan early.

This is MySQL 5.6.

Best Answer

Assuming

Some queries involve a date range, usually "recent" dates,
Other queries are point-queries, which involves a unique key
You can purge data older than 18 months.

Anything else?

UUIDs are terrible for performance when the dataset size becomes too big to cache.

It would take some work, but this is what I recommend:

id INT UNSIGNED NOT NULL AUTO_INCREMENT -- or `BIGINT` if might go past 4 billion
dt DATETIME NOT NULL  -- or TIMESTAMP if you prefer
PRIMARY KEY(dt, id)
INDEX(id)

Characteristics:

The data is in datetime order, thereby making those queries very efficient,
Tacking id onto the PK assures that the it is unique,
id replaces your uuid_short in the application (so, code change),
There is a PK, which InnoDB really likes to have,
You won't need to keep adding RAM -- most of the activity is near the "end" of the table,
INSERTs are very efficient -- both the data (& PK) and the secondary key (id) are being added to at the end.

Now, I will add on the purging.

PARTITION BY RANGE(TO_DAYS(dt)) is the optimal way to do it. Define the ranges as months. Each month: DROP (instantly) the oldest month and ADD a new month. See my blog for more details.

How to transition from uuid_short to id?

Add the AUTO_INCREMENT id and INDEX(id) to the existing table in a single ALTER,
Convert the app's use of uuid_short to id,
Drop the uuid_short column and change the indexes (all in a single ALTER),
Add PARTITIONing.

Steps 1,3,4 will each require copying over the entire table; the sooner you do it, the less downtime you will suffer. Or consider pt-online-schema-change or gh-ost for doing them "online".

Another tip: See if you can shrink any of the datatypes of other columns.

Related Solutions

MySQL Partitioning

Personally I'd use the date as your partition function, and partition by a hash of the year and month. Maybe splitting the data into 48 or more partitions. I've done this on some large volume databases and had good results.

ALTER TABLE `your_table` 
PARTITION BY HASH(YEAR(`date_field`)*12 + MONTH(`date_field`)) 
  PARTITIONS 48;

This should create a nice distributed set of data across 48 partitions (you may need to fiddle with the calculation on the date to get it quite right for your needs).

I build a model in Excel, with all the dates down one column, put the partition function on the second showing which partition that data would appear in. You can then chart the second column frequency to see how the data distribution is placed across the partitions - a really useful way of tinkering with your function before you alter your table!

Hope that helps...

Sql-server – SQL Server Partitioning – what to use for partition key

The server spec questions should be directed to either Serverfault or DBA.SE.

For the partitioning question, I don't think you necessarily need to partition for this.

360m rows is a lot but it's not too unwieldy.

Do NOT under any circumstances try to partition based on the last digit of a field. I'm not sure this would even work, but it's not SARGable which wouldn't be tenable.

If you only need to do a single row seek based on a numeric key, partitioning probably will not help.

If you do decide to pursue the partition route, bear in mind to be effective all your queries need to include your partition key(s) so the engine knows which partition to check. Otherwise it will check them all and you actually hurt performance.

Best Answer

Related Solutions

MySQL Partitioning

Sql-server – SQL Server Partitioning – what to use for partition key

Related Question