Mysql – Does partitioning help with lookups and inserts/updates by primary key

database-internalsmariadbMySQLpartitioning

I'm using MariaDB 10.1 and the default InnoDB storage storage, and I have a few tables with currently 10 to 100 million rows. These tables will keep growing a few million per month, and it's mostly caching.

They either have a single-column primary key (BIGINT) or a composite primary key (two BIGINT columns) and no AUTO_INCREMENT, and I always insert, select or update by primary key. I also do a lot of joins by primary key, or selects WHERE PK IN (1, 2, 3, 4...).

Also, these tables receive lots of updates per hour, and I usually update them in batches of 5000 or 10000 at a time. We have more inserts and updates than selects for some of these tables.

I have 3 questions:

  1. It seems to me that simple selects by PK returning 1 row (SELECT x, y FROM table WHERE pk = 123) will have no actual difference in performance with partitioning. Is that right?

  2. What about joins or selects as WHERE PK IN(SELECT PK FROM ...)? Will it cause more scans to join a partitioned table than a single table?

  3. Considering I usually do a lot of concurrent batches (multiple servers may send data at the same time) using:

    • INSERT INTO X VALUES (1, 'A'), (2, 'B'), ... ON DUPLICATE KEY UPDATE ... or

    • REPLACE INTO X VALUES(1, 'A'), (2, 'B'),...

Will partitioning help with concurrent inserts and updates, say by being able to affect multiple partitions at the same time?

Thanks in advance.

Best Answer

Your questions

Q1: If PK is the Partition key, a "point query" (WHERE PK = constant) on the Partition key will do "pruning", then (hopefully) use an index in the single partition it found. No improvement over having a simple table with a suitable index.
Q1: If PK is the PRIMARY KEY, but not the Partition key, then the query must open and look in every partition to find the row, hence slower.

Q2: Generally IN ( SELECT ... ) performs poorly, very poorly in some cases, and should be avoided. Use a JOIN instead. For partitioning, "pruning" will probably not be used. Hence slow.

Q3: INSERT may be about the same speed, regardless. A plain table may be faster due to less complexity.
Q3: REPLACE is DELETE + INSERT. INSERT ... ON DUPLICATE KEY UPDATE ... is usually a better construct; see if it applies. But, again, no performance improvement.

IN

WHERE PK IN (1, 2, 3, 4...) -- I think I heard that if the list is 'short', it will use partition pruning. Or skip the pruning if the list is long. In either case, a simple table with an index on PK will be at least as fast.

Pruning

Think of pruning this way: First it must find the partition (which acts very much like a "sub table"), then it must use an index (if available) to drill down to the desired row(s). Without partitioning, it skips the pruning step; but the BTree index is slightly deeper. So it is a trade-off.

Note: This often means that the optimal set of indexes for a partitioned table and the equivalent non-partitioned table are different.

Case 3

The caveat "A table's index is too big to be cached, but the index for one partition is cacheable" does not seem to apply to your situation. Case 3 came from a time-series wherein most of the activity was in the latest partition (and PARTITION BY RANGE(TO_DAYS(...))) and that partition fit in the buffer_pool but the entire table did not.

(That is, I agree with Natan, not Jarwad, about the relevance of Case 3.)

BY HASH

BY HASH -- not useful. (From the link: "PARTITION BY RANGE is the only useful method.") (Or at least, I have yet to see a use case of Hash where performance benefits.)

Ingestion

"growing a few million per month" That's not very fast. "growing a few million per day" would begin to be challenging, at which point I would point you at my high speed ingestion blog . Even so, you might take one tip from there -- Load the changes into a transient table, then do the insert/replace/iodku/update/etc from it to the 'real' table.