The table has a secondary index on name
(let's call it idx_name for the reference). Internally MySQL stores records in secondary indexes as a pair of (key, value), where key is indexed field and value is the primary key. In this case it will be (name, id).
To execute the first query MySQL decided to use index idx_name
. But in order to satisfy the query it has to take id
for each value of name
and go to PRIMARY index in order to get address
and phone
values. That's why they call it "additional lookup".
For the second query MySQL has all necessary fields in index idx_name
. Remember, id is the part of the index?
Putting the index on (skill_name, count)
rather than just (skill_name)
would help a lot. That way it could directly walk the index to get the 1000 highest counts within the 'java' skill, rather than needing to do a sort and throw away 99.8% of the results.
But, that doesn't explain the situation, it just circumvents it, so I'll ignore that solution for the sake of providing an explanation.
The faster table is well clustered. It needs to visit 228547 rows, but only needs to visit 1889 pages to do so. For each page it reads, it find 120 rows which meet the condition of skill1 = 'java'. (120 rows is about as many as you can fit onto a page, so those pages are packed full of nothing but 'java' entries)
The slower table is not well clustered on that column. To visit 627761 rows, it needs to visit 605929 pages, or just under 1.04 relevant rows per page.
Worse, many of these pages are "lossy", which certainly slows things down. Increasing work_mem could help there. Since the faster query visits 3 times fewer rows, it doesn't over-flow work_mem the same way as the slower one does. If it were the other way around and the one that was well-clustered was overflowing work_mem, the situation would be much better. When it overflows, you have to inspect every row in the page to recheck the condition. If 120 rows per page satisfy that recheck, this is no big deal. When only 1.04 rows per page satisfy the rcheck, that is a lot of extra work for nothing.
If increasing work_mem is not enough, you could also cluster the table on title_skill_skill_idx
, which would put all the rows with the skill of 'java' next to each other in the table, like they are in the faster case. Of course if the table is already well-clustered on another column (title
, for example) then changing this clustering could improve one query only to make a different ones worse.
Best Answer
The book is assuming that PersonFriend is indexed on PersonID, but not on FriendID. It also seems to assume that Person indexes PersonID and Person independently.
If this is the case, the first query comes back as
Then the second comes back as
This would be an elementary relational database design error. With a relationship table like this you would always index all keys, and the performance profile for both queries would be indistinguishable (ignoring variations in value distributions, etc).
At this point it sounds like the author is trying to show that relational databases are intrinsically deficient.