MySQL ORDER BY columns across multiple joined tables

indexjoin;MySQLorder-by

I have a database which consists of three tables, with the following structure:

restaurant table: restaurant_id, location_id, rating. Example: 1325, 77, 4.5

restaurant_name table: restaurant_id, language, name. Example: 1325, 'en', 'Pizza Express'

location_name table: location_id, language, name. Example: 77, 'en', 'New York'

I would like to get the restaurant info in English, sorted by location name and restaurant name, and use the LIMIT clause to paginate the result. So my SQL is:

SELECT ln.name, rn.name
FROM restaurant r
INNER JOIN location_name ln
ON r.location_id = ln.location_id
AND ln.language = 'en'
INNER JOIN restaurant_name rn
ON r.restaurant_id = rn.restaurant_id
AND rn.language = 'en'
ORDER BY ln.name, rn.name
LIMIT 0, 50

This is terribly slow – so I refined my SQL with deferred JOIN, which make things a lot faster (from over 10 seconds to 2 seconds):

SELECT ln.name, rn.name
FROM restaurant r
INNER JOIN (
    SELECT r.restaurant_id
    FROM restaurant r
    INNER JOIN location_name ln
    ON r.location_id = ln.location_id
    AND ln.language = 'en'
    INNER JOIN restaurant_name rn
    ON r.restaurant_id = rn.restaurant_id
    AND rn.language = 'en'
    ORDER BY ln.name, rn.name
    LIMIT 0, 50
) r1
ON r.restaurant_id = r1.restaurant_id
INNER JOIN location_name ln
ON r.location_id = ln.location_id
AND ln.language = 'en'
INNER JOIN restaurant_name rn
ON r.restaurant_id = rn.restaurant_id
AND rn.language = 'en'
ORDER BY ln.name, rn.name

2 seconds is unfortunately still not very acceptable to the user, so I go and check the EXPLAIN of the my query, and it appears that the slow part is on the ORDER BY clause, which I see "Using temporary; Using filesort". I checked the official reference manual about ORDER BY optimization and I come across this statement:

In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it may still use indexes to find the rows that match the
WHERE clause. Examples:

The query joins many tables, and the columns in the ORDER BY are not
all from the first nonconstant table that is used to retrieve rows.
(This is the first table in the EXPLAIN output that does not have a
const join type.)

So for my case, given that the two columns I'm ordering by are from the nonconstant joined tables, index cannot be used. My question is, is there any other approach I can take to speed things up, or what I've done so far is already the best I can achieve? Must I move the columns I wanna sort back to the primary table? (But my site actually offers multiple ways to sort the data, so I'll have to eventually move 6 to 7 columns, causing a lot of data redundancy…)


Below are the DDL of the table. I built them for illustrating this problem only, the real table has much more columns.

CREATE TABLE restaurant (
  restaurant_id INT NOT NULL AUTO_INCREMENT,
  location_id INT NOT NULL,
  rating INT NOT NULL,
  PRIMARY KEY (restaurant_id),
  INDEX idx_restaurant_1 (location_id)
);

CREATE TABLE restaurant_name (
  restaurant_id INT NOT NULL,
  language VARCHAR(255) NOT NULL,
  name VARCHAR(255) NOT NULL,
  INDEX idx_restaurant_name_1 (restaurant_id, language),
  INDEX idx_restaurant_name_2 (name)
);

CREATE TABLE location_name (
  location_id INT NOT NULL,
  language VARCHAR(255) NOT NULL,
  name VARCHAR(255) NOT NULL,
  INDEX idx_location_name_1 (location_id, language),
  INDEX idx_location_name_2 (name)
);

Below is the EXPLAIN output with the ORDER BY clause:

+----+-------------+------------+--------+--------------------------+-----------------------+---------+--------------------------------+------+----------------------------------------------+
| id | select_type | table      | type   | possible_keys            | key                   | key_len | ref                            | rows | Extra                                        |
+----+-------------+------------+--------+--------------------------+-----------------------+---------+--------------------------------+------+----------------------------------------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL                     | NULL                  | NULL    | NULL                           |   50 |                                              |
|  1 | PRIMARY     | rn         | ref    | idx_restaurant_name_1    | idx_restaurant_name_1 | 1538    | r1.restaurant_id,const,const   |    1 | Using where                                  |
|  1 | PRIMARY     | r          | eq_ref | PRIMARY,idx_restaurant_1 | PRIMARY               | 4       | r1.restaurant_id               |    1 |                                              |
|  1 | PRIMARY     | ln         | ref    | idx_location_name_1      | idx_location_name_1   | 1538    | test.r.location_id,const,const |    1 | Using where                                  |
|  2 | DERIVED     | rn         | ALL    | idx_restaurant_name_1    | NULL                  | NULL    | NULL                           | 8484 | Using where; Using temporary; Using filesort |
|  2 | DERIVED     | r          | eq_ref | PRIMARY,idx_restaurant_1 | PRIMARY               | 4       | test.rn.restaurant_id          |    1 |                                              |
|  2 | DERIVED     | ln         | ref    | idx_location_name_1      | idx_location_name_1   | 1538    | test.r.location_id             |    1 | Using where                                  |
+----+-------------+------------+--------+--------------------------+-----------------------+---------+--------------------------------+------+----------------------------------------------+

Below is the EXPLAIN output without the ORDER BY clause:

+----+-------------+------------+--------+--------------------------+-----------------------+---------+--------------------------------+------+--------------------------+
| id | select_type | table      | type   | possible_keys            | key                   | key_len | ref                            | rows | Extra                    |
+----+-------------+------------+--------+--------------------------+-----------------------+---------+--------------------------------+------+--------------------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL                     | NULL                  | NULL    | NULL                           |   50 |                          |
|  1 | PRIMARY     | rn         | ref    | idx_restaurant_name_1    | idx_restaurant_name_1 | 1538    | r1.restaurant_id,const,const   |    1 | Using where              |
|  1 | PRIMARY     | r          | eq_ref | PRIMARY,idx_restaurant_1 | PRIMARY               | 4       | r1.restaurant_id               |    1 |                          |
|  1 | PRIMARY     | ln         | ref    | idx_location_name_1      | idx_location_name_1   | 1538    | test.r.location_id,const,const |    1 | Using where              |
|  2 | DERIVED     | rn         | index  | idx_restaurant_name_1    | idx_restaurant_name_1 | 1538    | NULL                           | 8484 | Using where; Using index |
|  2 | DERIVED     | r          | eq_ref | PRIMARY,idx_restaurant_1 | PRIMARY               | 4       | test.rn.restaurant_id          |    1 |                          |
|  2 | DERIVED     | ln         | ref    | idx_location_name_1      | idx_location_name_1   | 1538    | test.r.location_id             |    1 | Using where; Using index |
+----+-------------+------------+--------+--------------------------+-----------------------+---------+--------------------------------+------+--------------------------+

Thanks in advance for your help!

Best Answer

I see "over-normalization" as the main problem.

The focus for this query seems to be on "language", yet that is the last thing checked.

For one attempt at a solution, let's rearrange the schema to these two tables:

CREATE TABLE restaurant_attributes (
  restaurant_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  rating INT NOT NULL,
  PRIMARY KEY (restaurant_id),
);

CREATE TABLE restaurants_by_lang (
  restaurant_id INT UNSIGNED NOT NULL,
  language VARCHAR(5) NOT NULL CHARACTER SET ascii,  -- see note
  name VARCHAR(255) NOT NULL,
  location VARCHAR(255) NOT NULL,
  PRIMARY KEY(language, restaurant_id),
  INDEX (language, location, name),  -- perfect for the query
  INDEX (name),
  INDEX (location)
);

Now the query is simply:

SELECT location, name
    FROM restaurants_by_lang
    WHERE language = 'en'
    ORDER BY location, name
    LIMIT 0, 50;

This schema will even allow for efficient "pagination", which will be your next problem. (See here.)

I will suggest that if you have a million restaurants, listing the first 50 by location is a next-to-useless UI design. I recommend re-thinking the need for the query. For example, instead of paginating through the entire list, I suggest some form of drill-down such as country to state to city/area to restaurants. It will be a lot faster for the user to find Zahir's in Zimbabwe.

But... Since I can't see the rest of your SELECTs, I don't know what other things I may have made worse for some of them.

I picked a 5-char language based on a standard for such. Do not use 255 unless you really need such.

Testing

When testing with less data than you will eventually have, this technique is often handy...

FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';

Then look at the numbers. A number that approximates the number of rows in the table (or some multiple thereof) indicates a table scan(s). Some number that approximates the LIMIT value indicates that the query could whittle down the task efficiently.

I claim you will see '50' with my approach, and at least 7K (20K?) from your approach.

Note: Page 4 (with LIMIT 150, 50) would say 200 for mine; yours will not change. With the technique in my pagination link, even page 4 will be only 50.

Each temp table (due to GROUP BY and/or ORDER BY) will show Handler_write indicating the number of rows (times the number of tmp tables).