Mysql – How to fix slow LIMIT queries with big offsets without heavily rewriting them

MySQLperformancequery-performance

Imagine a very basic example of your average discussion board. For example:

CREATE TABLE threads (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    title VARCHAR(100),
    PRIMARY KEY (id)
)

CREATE TABLE replies (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    thread_id INT UNSIGNED NOT NULL,
    text TEXT NOT NULL,
    PRIMARY KEY (id)
    INDEX thread_id (thread_id),
)

There can be very lengthy discussions (people love bikeshedding!), maybe 100k or 200k replies per thread and users can read them paginated (the number of replies per page is variable, depending on user preferences, but if needed for the solution it can be limited to a fixed set). These tables might have ~40 million replies and ~2 million threads.

So you might end running this query to get the last replies of a thread:

SELECT * FROM replies
WHERE thread_id = 1234
ORDER BY id ASC
LIMIT 125400,10 /* whoops */

Which, as you know, is quite slow since MySQL has to walk 125,400 rows just to get there and return your 10 rows.

Hacky solutions I've thought:

  1. Create a secondary index which assigns an incrementing number for each chunk of N posts (for example, a new field in the replies table which for the first 1000 posts contains 1, for the following 1000 it contains 2, etc).

    • I have to heavily modify the application, since there are tons of queries that read the replies table, it's not just a simple SELECT here and there and I really don't want to cripple and reengineer each query of the application.
    • It would force me to recalculate each time that I delete a reply or when I do other destructive operations (splitting, merging, etc).
  2. For each link to the next page, attach the ID of the next post. That way the database can go directly to the row using the primary key of the replies table.

    • This is a web application, so this solution would have tricky SEO implications which I'd prefer not to deal with.

I might be dreaming here (and if so please do tell me!) but is there a solution that resides (almost) exclusively in the database and allows me to fix this problem without heavily modifying the application?

I've read about MySQL partitions, but I'm not sure they would help here.

Best Answer

You can add a new column to replies, call it position, and fill it with consecutive numbers of replies per thread (the position of the reply in the thread).

For example

id | thread_id | text | position
 1 |         1 | .... | 1
 2 |         2 | .... | 1
 3 |         1 | .... | 2
 4 |         1 | .... | 3
 5 |         2 | .... | 2
 6 |         3 | .... | 1

Further put an index on (thread_id, position, id) and it allows you to write queries like

SELECT * FROM replies
WHERE thread_id = 1234
AND position BETWEEN 125400 AND 125410
ORDER BY id ASC

which runs fast, since this does not need a full index scan.

You can either update this column in your application, or write a database trigger to do this automatically.

The initial effort is quite high I admit. We used this trick a few years ago on a high write frequented, quite large table, and like I said it cost some effort to get it running, but when the solution was in place, the performance gain was overwhelming.