I'm using MariaDB 10.1 and the default InnoDB storage storage, and I have a few tables with currently 10 to 100 million rows. These tables will keep growing a few million per month, and it's mostly caching.
They either have a single-column primary key (BIGINT
) or a composite primary key (two BIGINT
columns) and no AUTO_INCREMENT, and I always insert, select or update by primary key. I also do a lot of joins by primary key, or selects WHERE PK IN (1, 2, 3, 4...)
.
Also, these tables receive lots of updates per hour, and I usually update them in batches of 5000 or 10000 at a time. We have more inserts and updates than selects for some of these tables.
I have 3 questions:
-
It seems to me that simple selects by PK returning 1 row (
SELECT x, y FROM table WHERE pk = 123
) will have no actual difference in performance with partitioning. Is that right? -
What about joins or selects as
WHERE PK IN(SELECT PK FROM ...)
? Will it cause more scans to join a partitioned table than a single table? -
Considering I usually do a lot of concurrent batches (multiple servers may send data at the same time) using:
-
INSERT INTO X VALUES (1, 'A'), (2, 'B'), ... ON DUPLICATE KEY UPDATE ...
or -
REPLACE INTO X VALUES(1, 'A'), (2, 'B'),...
-
Will partitioning help with concurrent inserts and updates, say by being able to affect multiple partitions at the same time?
Thanks in advance.
Best Answer
Your questions
Q1: If PK is the Partition key, a "point query" (
WHERE PK = constant
) on the Partition key will do "pruning", then (hopefully) use an index in the single partition it found. No improvement over having a simple table with a suitable index.Q1: If PK is the
PRIMARY KEY
, but not the Partition key, then the query must open and look in every partition to find the row, hence slower.Q2: Generally
IN ( SELECT ... )
performs poorly, very poorly in some cases, and should be avoided. Use aJOIN
instead. For partitioning, "pruning" will probably not be used. Hence slow.Q3:
INSERT
may be about the same speed, regardless. A plain table may be faster due to less complexity.Q3:
REPLACE
isDELETE
+INSERT
.INSERT ... ON DUPLICATE KEY UPDATE ...
is usually a better construct; see if it applies. But, again, no performance improvement.IN
WHERE PK IN (1, 2, 3, 4...)
-- I think I heard that if the list is 'short', it will use partition pruning. Or skip the pruning if the list is long. In either case, a simple table with an index on PK will be at least as fast.Pruning
Think of pruning this way: First it must find the partition (which acts very much like a "sub table"), then it must use an index (if available) to drill down to the desired row(s). Without partitioning, it skips the pruning step; but the BTree index is slightly deeper. So it is a trade-off.
Note: This often means that the optimal set of indexes for a partitioned table and the equivalent non-partitioned table are different.
Case 3
The caveat "A table's index is too big to be cached, but the index for one partition is cacheable" does not seem to apply to your situation. Case 3 came from a time-series wherein most of the activity was in the latest partition (and
PARTITION BY RANGE(TO_DAYS(...))
) and that partition fit in the buffer_pool but the entire table did not.(That is, I agree with Natan, not Jarwad, about the relevance of Case 3.)
BY HASH
BY HASH
-- not useful. (From the link: "PARTITION BY RANGE is the only useful method.") (Or at least, I have yet to see a use case of Hash where performance benefits.)Ingestion
"growing a few million per month" That's not very fast. "growing a few million per day" would begin to be challenging, at which point I would point you at my high speed ingestion blog . Even so, you might take one tip from there -- Load the changes into a transient table, then do the insert/replace/iodku/update/etc from it to the 'real' table.