MySQL – Left Join takes too long, how to optimize query

innodbjoin;MySQLselect

A leader may have many followers. A notification_followers table gets a single notification when a leader adds a post with an entry leader_id 1 and notifiable_id 0 (id 1,2 in table). The same table gets a single notification when the current user 14 is followed by someone, with an entry leader_id 0 and notifiable_id 14 (id 3 in table).

notification_followers (id is PRIMARY, each field except data is an index on its own)

| id | uuid               | leader_id | notifable_id | data   | created_at
-----------------------------------------------------------------------------------
| 1  | 001w2cwfoqzp8F3... | 1         | 0            | Post A | 2018-04-19 00:00:00
| 2  | lvbuX4d5qCHJUIN... | 1         | 0            | Post B | 2018-04-20 00:00:00
| 3  | eEq5r5g5jApkKgd... | 0         | 14           | Follow | 2018-04-21 00:00:00

All the follower related notifications are in one place now which is perfect.

We need to now check if the user 14 is a follower of leader_id 1 to know whether to show them notifications 1 and 2. For that, we scan the user_follows table to see if the logged in user exists as a followed_id to the leader_id so they know about the notification, but only if they followed the leader before the notification was posted (new followers should not get older post notifications when the follow the user, only new ones).

user_follows (id is PRIMARY, each field is an index on its own)

| id | leader_id | follower_id | created_at
----------------------------------------------------
| 1  | 1         | 14         |  2018-04-18 00:00:00 // followed before, has notifs
| 2  | 1         | 15         |  2018-04-22 00:00:00 // followed after, no notifs

The final thing to note, is the user should know if the notification was read or not, this is where the notification_followers_read table comes in. It stores the follower_id along with the notification_uuid for all read notifications, along with their read_at timestamp.

notification_followers_read (composite index on notification_uuid, follower_id)

| notification_uuid | follower_id | read_at
--------------------------------------------------------
  qIXE97AP49muZf... | 17          | 2018-04-21 00:00:00 // not for 14, we ignore it

We now want to return the latest 10 notifications ordered by the auto incrementing nf.id desc for user 14. They should see all 3 notifications from notification_followers, since non of them were read by this user yet. The first 2, since they followed the leader before the leader made the posts, and the 3rd notification, since they were followed and their notifiable_id is 14.

Here is the query which works, but take too long ~9 secs:

SELECT nf.id, nf.uuid, nf.leader_id, nf.data, nf.created_at, nfr.read_at
FROM notification_followers nf
LEFT JOIN user_follows uf ON uf.leader_id = nf.leader_id AND uf.follower_id = 14
LEFT JOIN notification_followers_read nfr ON nf.uuid = nfr.notification_uuid AND nfr.follower_id = 14
WHERE (nf.created_at > uf.created_at OR notifiable_id = 14)
ORDER BY nf.id DESC LIMIT 10

notification_followers has ~100K records and we're using InnoDB. Here is the EXPLAIN for the query:

Explain

How can we optimize the query so it runs in a few ms?

UPDATE WITH SQL DUMP

SQL DUMP TO REPRODUCE LOCALLY just create speed_test database locally and import file to see the slow query issue live with all the table data (~100K rows).

Best Answer

Summary from the comments:

So far I'm getting the best result with

CREATE INDEX nfr_fid_nuuid
             ON notification_followers_read
                (follower_id,
                 notification_uuid);

and

CREATE INDEX uf_fid_lid
             ON user_follows
                (follower_id,
                 leader_id);

and all the other indexes, except the primary ones, dropped. For notification_followers it used the PRIMARY index with me. I couldn't find anything better than PRIMARY for this table so far.

Tests were done on a MySQL v5.7.21 32 bit on Windows 7 32 bit.

Execution times were about 4 secs without and .2 secs with the indexes as stated above.

Some lines on how, why and whatever: (haven't had the space for that in the comments)

(Disclaimer: My knowledge on that shouldn't be All bad overall. Though, in some aspects my understanding might be improvable or just plainly wrong. Anyone feel free to correct me if I'm wrong somewhere -- edits or comments welcome.)

A general thing about joins in terms of performance:

As already mentioned in the comments, one goal with joins is to keep the sets joined as small as possible, as early as possible. For illustration: In the worst case, when a nested loop join must be applied, an A JOIN B needs #A*#B (let #A be the number of rows in A, analog for B) comparison operations. So any row from A (or B), which can be ruled out before the actual join operation is applied, will reduce the number of operations not only by 1 but by #B (or #A). One would want that in terms of performance.

If a join can be done via an index, especially in a way, that the DBMS can easily localize the portion of the index relevant for the join (i.e. keeping the set small), that can be a huge booster. Of course there are some other advantages an index can provide here (e.g.: the rows are already accessible in a sorted manner supporting more efficient join methods, the index might be significantly smaller and fit largely into memory thus reducing the need for constant disk IO, ...).

But all that is a topic on it's own, so this is just meant as a rough abstract.

On the query the question is about:

Now first thing to notice about the query: It is a LEFT OUTER JOIN (OK actually it's two but that doesn't matter for this thought). notification_followers is the left table here, so its set of records won't get reduced by the joins, just the WHERE can do that.

The WHERE unfortunately is an OR. These are difficult and "bad" in contrast to AND. It is more like a union hence keeping the set large, than like an intersect reducing the set's cardinality (Compare: For A OR B the result set is all rows WHERE A UNION all rows WHERE B in contrast to A AND B for which the result set is the all rows WHERE A INTERSECT all rows WHERE B).

So the WHERE isn't a too promising target to be answered from one index alone in a single run. Furthermore one of the ORed operations from the WHERE (nf.created_at > uf.created_at) depends on the joined data, so that one can only be applied after (or at best while) the join.

There is also the ORDER BY which can be expensive especially, when the result set is too large to be kept in memory. It then needs to be sorted with constantly writing and reading from the disk (for a larger buffer). And disk access takes a lot of time.

So my hope for notification_followers was to find a compound index, that would support the ORDER and ideally at least one of the ORed comparisons. As mentioned I failed on that. But my expectations also weren't too high on that, given the discussion on that part above.

Or PRIMARY is just good enough for that in the view of the DBMS, which might be fine. As I understand tables with a primary key in InnoDB are actually stored as clustered indexes. What I couldn't find (quickly) in the docs was, if the records are also double linked in order by the primary key. That would allow PRIMARY to support the ORDER by a reverse traverse of that linked list and make PRIMARY a good choice indeed.

The ONs of the joined tables are rather easy in contrast to the WHERE and the ORDER. (Exemplary I will use the join with user_follows, notification_followers_read is analog.) Here we have two relevant columns, leader_id and follower_id.

follower_id seems to be more suitable for the first column of a compound index. It's compared with a literal, hence independent of the partner rows of the join. The relevant portion of the index, a subtree -- "normal" indexes in MySQL are some B tree variant -- can thus be (re)used for all join partners. And also note the reduction of the set of possible rows from user_follows here!

Also having leader_id as a column in that index should then make the user_follows's part of the join answerable from this index alone. And indeed it worked.

Note, that the order of the columns in the statement isn't necessarily the same for an index on them. Whatever is commutable is commuted by the optimizer, if it promises to be better. So the order won't be necessarily kept anyway. To find a good order of columns for an index one must mainly think about what order would partition the index in a most "radical" fashion first (leaving the remaining part as small as possible).