Mysql – Does partitioning help with lookups and inserts/updates by primary key

database-internalsmariadbMySQLpartitioning

I'm using MariaDB 10.1 and the default InnoDB storage storage, and I have a few tables with currently 10 to 100 million rows. These tables will keep growing a few million per month, and it's mostly caching.

They either have a single-column primary key (BIGINT) or a composite primary key (two BIGINT columns) and no AUTO_INCREMENT, and I always insert, select or update by primary key. I also do a lot of joins by primary key, or selects WHERE PK IN (1, 2, 3, 4...).

Also, these tables receive lots of updates per hour, and I usually update them in batches of 5000 or 10000 at a time. We have more inserts and updates than selects for some of these tables.

I have 3 questions:

It seems to me that simple selects by PK returning 1 row (SELECT x, y FROM table WHERE pk = 123) will have no actual difference in performance with partitioning. Is that right?
What about joins or selects as WHERE PK IN(SELECT PK FROM ...)? Will it cause more scans to join a partitioned table than a single table?
Considering I usually do a lot of concurrent batches (multiple servers may send data at the same time) using:
- INSERT INTO X VALUES (1, 'A'), (2, 'B'), ... ON DUPLICATE KEY UPDATE ... or
- REPLACE INTO X VALUES(1, 'A'), (2, 'B'),...

Will partitioning help with concurrent inserts and updates, say by being able to affect multiple partitions at the same time?

Thanks in advance.

Best Answer

Your questions

Q1: If PK is the Partition key, a "point query" (WHERE PK = constant) on the Partition key will do "pruning", then (hopefully) use an index in the single partition it found. No improvement over having a simple table with a suitable index.
Q1: If PK is the PRIMARY KEY, but not the Partition key, then the query must open and look in every partition to find the row, hence slower.

Q2: Generally IN ( SELECT ... ) performs poorly, very poorly in some cases, and should be avoided. Use a JOIN instead. For partitioning, "pruning" will probably not be used. Hence slow.

Q3: INSERT may be about the same speed, regardless. A plain table may be faster due to less complexity.
Q3: REPLACE is DELETE + INSERT. INSERT ... ON DUPLICATE KEY UPDATE ... is usually a better construct; see if it applies. But, again, no performance improvement.

WHERE PK IN (1, 2, 3, 4...) -- I think I heard that if the list is 'short', it will use partition pruning. Or skip the pruning if the list is long. In either case, a simple table with an index on PK will be at least as fast.

Pruning

Think of pruning this way: First it must find the partition (which acts very much like a "sub table"), then it must use an index (if available) to drill down to the desired row(s). Without partitioning, it skips the pruning step; but the BTree index is slightly deeper. So it is a trade-off.

Note: This often means that the optimal set of indexes for a partitioned table and the equivalent non-partitioned table are different.

Case 3

The caveat "A table's index is too big to be cached, but the index for one partition is cacheable" does not seem to apply to your situation. Case 3 came from a time-series wherein most of the activity was in the latest partition (and PARTITION BY RANGE(TO_DAYS(...))) and that partition fit in the buffer_pool but the entire table did not.

(That is, I agree with Natan, not Jarwad, about the relevance of Case 3.)

BY HASH

BY HASH -- not useful. (From the link: "PARTITION BY RANGE is the only useful method.") (Or at least, I have yet to see a use case of Hash where performance benefits.)

Ingestion

"growing a few million per month" That's not very fast. "growing a few million per day" would begin to be challenging, at which point I would point you at my high speed ingestion blog . Even so, you might take one tip from there -- Load the changes into a transient table, then do the insert/replace/iodku/update/etc from it to the 'real' table.

Related Solutions

MySQL Partitioning

Personally I'd use the date as your partition function, and partition by a hash of the year and month. Maybe splitting the data into 48 or more partitions. I've done this on some large volume databases and had good results.

ALTER TABLE `your_table` 
PARTITION BY HASH(YEAR(`date_field`)*12 + MONTH(`date_field`)) 
  PARTITIONS 48;

This should create a nice distributed set of data across 48 partitions (you may need to fiddle with the calculation on the date to get it quite right for your needs).

I build a model in Excel, with all the dates down one column, put the partition function on the second showing which partition that data would appear in. You can then chart the second column frequency to see how the data distribution is placed across the partitions - a really useful way of tinkering with your function before you alter your table!

Hope that helps...

Sql-server – do partitions make sense when dates are not involved

You did not explain why you want to partition the table and what do you expect from partitioning. You only mention table size, which is hardly a criteria for partitioning. Performance wise partitioning will make everything slower, not faster. The best you can hope for is on-par performance with the unpartitioned table. Some scenarios that do make sense with partitioning are:

ETL needs like load jobs that manipulate data intensively in staging tables and then switch in the entire staging table in one fast operation
need to remove large volumes of data that has past retention period (monthly switch out and truncate)
administrative reasons, like the need to rebuild individual partitions

Many quote reasons like 'move old data to slower disks' but I don't buy that argument much. And another oft quoted reason is to distribute data in multiple files, but that is incorrect thinking as a filegroup can contain multiple files across many volumes and the engine would distribute IO across them anyway w/o any partitioning need.

You mention that your developers quote a case when partitioning by data increased performance. Perhaps is a case of a time series data that was clustered by id, not by date and in which all range queries (typical for time series) had to do table scans. Partitioning appeared to help because partition elimination reduced the amount of data scanned. But a proper clustered index would had addressed the problem much better (no surprise there, indexing is the usually correct answer to query performance issues).

An extreme case when partitioning is useful is hash partitioning to help spread insert last page latch contention.

But when push come to shove the partitioning is an 'all-or-nothing' approach that has a very heavy impact (consider that you can no longer have an unique primary key that does not include the partitioning field anymore) and percolates through the data model everywhere (eg. a lot of foreign keys have to be re-designed). It requires careful administration. Query Optimizer can misfire some horrible plans in presence of partitioning.

Kendra Little has a nice article explaining pros and cons: How To Decide if You Should Use Table Partitioning

Best Answer

Related Solutions

MySQL Partitioning

Sql-server – do partitions make sense when dates are not involved

Related Question