MySQL – Efficient Partition Pruning with ORDER BY and LIMIT

index-tuningmariadbMySQLorder-bypartitioning

I've set up a table in MariaDB (10.4.5, currently RC) with InnoDB using partitioning by a column of which its value is incrementing-only and new data is always inserted at the end. For this case, partitioning makes sense to speed up some queries and to keep new/active partitions on fast drives and older/archived ones on slow spinning disks. For insert speedups it's working great! (Sort of the TimescaleDb-approach, but without time and without PostgreSQL.)

SELECTs by range on that same column works fine too; it will only start to read (the index of) the partitions of the specified range. All cool so far.

Now, I also have queries which do not have a clause on that column, but are ordered descending by that column (ie. fresh data first), together with a limit, which usually would hit only one or two latest partition (fast, cached index). However, it seems that MySQL/MariaDB starts to open partitions from first to last no matter what the ordering specified is. Is it really that dumb? Moreover, I couldn't really find anyone else with this question, which worries me a bit. (Sometimes it means I'm missing something really obvious.)

To get more concrete here – for testing I have the following table:

CREATE TABLE `mytable` (
  `user_id` bigint(20) unsigned NOT NULL,
  `my_id` bigint(20) unsigned NOT NULL,
  `data` varbinary(123) DEFAULT NULL,
  PRIMARY KEY (`user_id`,`my_id`),
  UNIQUE KEY `my_id_idx` (`my_id`)  -- I was hoping this one could help me
) ENGINE=InnoDB ROW_FORMAT=COMPACT
 PARTITION BY RANGE (`my_id`)
(PARTITION `p0` VALUES LESS THAN (10000000) ENGINE = InnoDB,
 PARTITION `p10M` VALUES LESS THAN (20000000) ENGINE = InnoDB,
 PARTITION `p20M` VALUES LESS THAN (30000000) ENGINE = InnoDB,
 PARTITION `p30M` VALUES LESS THAN (40000000) ENGINE = InnoDB,
 [...]
) 

And I run a query like:

SELECT 
    user_id,
    my_id,
    LENGTH(data) AS data_len
FROM
    mytable
    -- tried to optimize with index hints:
    -- USE INDEX FOR ORDER BY (MY_ID_IDX)
    -- USE INDEX FOR ORDER BY (PRIMARY)
    -- USE INDEX FOR ORDER BY (MY_IDX, PRIMARY)
WHERE
    user_id = 1234567
ORDER BY my_id DESC
LIMIT 10;

I found out that it starts to look for ALL the data for user_id = 1234567 first, showing by heavy I/O load on spinning disks first, then finally getting to fast storage to get to the full set, then cutting off the last LIMIT 10 rows… which were all on fast storage so we wasted minutes of time for nothing! Hmm.

My data is too big that we can't have all indexes fit into memory – we rely on 'enough' of the index on disk to be cached on storage layer. But even if all indexes would all fit into cache, data has to come from disks and some users have HUGE amount of data here (>10M rows) and it's simply inefficient to do this sorting in memory like that. So I'm hoping to find a way to have MariaDB look for the last LIMIT amount of rows and then stop reading.

As a human, you would start looking in the last partition first, because it's ORDER BY my_id DESC and the latest partitions contains the highest values for it. However, how do I tell MySQL/MariaDB to do that?

explain partitions result (for all the USE INDEX variants listed above it's the same):

  select_type: SIMPLE
        table: mytable
   partitions: p0M,p10M,p20M,p30M, ... (~ hundred here)
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: const
         rows: 9999999 (worst-case)
        Extra: Using where

In fact, to the contrary of what I expected, it isn't even performing better if do the query in ascending order, using first-to-new partition. It will still request all the indexes of all partitions and then find out it only needed one…

I've heard something about a global index for partitions in future versions of MySQL, but I doubt that it is really going to help here given the huge size…, and it already has got the hint by the very partitioning layout in my case. The information that I find around 'partition pruning' seems unrelated to ordering of reads; only about clauses in the query.

Any help is appreciated. :-)

Newer partitions will be dynamically created and it's not really feasible to give a hint on a specific partition. My situation is that "newest" partitions are fast, "older" is "slow", "oldest" is "superslow" – assuming nothing cached on storage layer because too much. Additionally, I'm using a proxy (SPIDER) on a separate machine which is supposed to give the clients a single interface to query, not needing to know about the backends' partitioning layout, so I'd prefer a way to make it 'automatic'.

Best Answer

Congratulations. I think you found a case where partitioning can't be made to be even as fast as non-partitioning.

WHERE user_id = 1234567
ORDER BY my_id DESC
LIMIT 10;

Needs INDEX(user_id, my_id) in that order, and without partitioning. Thus, it would touch 10 rows and quit.

With the partitioning you have, it must check each partition, gather the row(s) found in each partition, sort them, then stop at the 10th.

"Partitioning is not a performance panacea".

Do you have other queries for which that PARTITION BY RANGE benefits? If so, you may have a trade-off situation. Namely, that some queries run faster, some run slower.

In general, if there are a reasonably limited number of "users", and you are inserting new rows for each user continually, it is fine to have one "hot spot" per user.

That leads to

PRIMARY KEY(user_id, my_id)

with my_id unique in some fashion. It does not have to be declared UNIQUE. If it is AUTO_INREMENT, then this works fine:

my_id INT AUTO_INCREMENT,
PRIMARY KEY(user_id, my_id)  -- to cluster by user
INDEX(my_id)   -- to keep AUTO_INCREMENT happy

With such, most queries like this work quite efficiently:

WHERE user_id = 12345
  AND ((other stuff))

The caching in the buffer_pool is more important than SSD vs HDD. And the number of blocks touched is important to performance.

The INSERTs need one block per user. Eventually, there will be a block split. But then, it is back to one active block (a "hot spot").

SELECTs, even if the desired blocks are not in the buffer_pool tend to be efficient due to WHERE user_id=... leading to the desired rows being in very few blocks. That is especially true for the SELECT ... LIMIT 10 that you mentioned.

Blocks are cached. Whole INDEXes are not. The query in question will look at only 1 (maybe 2) block in the non-partitioned layout. The rest of the index will come and go based on activity.

10M rows is 'large'; 1 billion rows is 'huge'. Global indexes are probably years off for both MySQL and MariaDB; don't hold your breath.

What is the value of innodb_buffer_pool_size? How much RAM?