Mysql – Must an index cover all selected columns for it to be used for ORDER BY

indexinnodbMySQLorder-by

Over at SO, someone recently asked Why isn't ORDER BY using the index?

The situation involved a simple InnoDB table in MySQL comprising three columns and 10k rows. One of the columns, an integer, was indexed—and the OP sought to retrieve his entire table sorted on that column:

SELECT * FROM person ORDER BY age

He attached EXPLAIN output showing that this query was resolved with a filesort (rather than the index) and asked why that would be.

Despite the hint FORCE INDEX FOR ORDER BY (age) causing the index to be used, someone answered (with supporting comments/upvotes from others) that an index is only used for sorting when the selected columns are all read from the index (i.e. as would normally be indicated by Using index in the Extra column of EXPLAIN output). An explanation was later given that traversing the index and then fetching columns from the table results in random I/O, which MySQL views as more expensive than a filesort.

This appears to fly in the face of the manual chapter on ORDER BY Optimization, which not only conveys the strong impression that satisfying ORDER BY from an index is preferable to performing additional sorting (indeed, filesort is a combination of quicksort and mergesort and therefore must have a lower bound of Ω(nlog n); whilst walking through the index in order and seeking into the table ought to be O(n)—so this makes perfect sense), but it also neglects to mention this alleged "optimisation" whilst also stating:

The following queries use the index to resolve the ORDER BY part:

SELECT * FROM t1
  ORDER BY key_part1,key_part2,... ;

To my reading, that is precisely the case in this situation (yet the index was not being used without an explicit hint).

My questions are:

  • Is it indeed necessary for all selected columns to be indexed in order for MySQL to choose to use the index?

    • If so, where is this documented (if at all)?

    • If not, what was going on here?

Best Answer

Is it indeed necessary for all selected columns to be indexed in order for MySQL to choose to use the index?

This is a loaded question because there are factors that determine whether an index is worth using.

FACTOR #1

For any given index, what is the key population? In other words, what is the cardinality (distinct count) of all tuples recorded in the index?

FACTOR #2

What storage engine are you using? Are all needed columns accessible from an index?

WHAT'S NEXT ???

Let's take a simple example: a table that holds two values (Male and Female)

Let create such a table with a test for index usage

USE test
DROP TABLE IF EXISTS mf;
CREATE TABLE mf
(
    id int not null auto_increment,
    gender char(1),
    primary key (id),
    key (gender)
) ENGINE=InnODB;
INSERT INTO mf (gender) VALUES
('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
('M'),('M'),('M'),('M'),('F'),('F'),('M'),('M'),
('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
('F'),('M'),('M'),('M'),('M'),('M'),('M'),('M');
ANALYZE TABLE mf;
EXPLAIN SELECT gender FROM mf WHERE gender='F';
EXPLAIN SELECT gender FROM mf WHERE gender='M';
EXPLAIN SELECT id FROM mf WHERE gender='F';
EXPLAIN SELECT id FROM mf WHERE gender='M';

TEST InnoDB

mysql> USE test
Database changed
mysql> DROP TABLE IF EXISTS mf;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE mf
    -> (
    ->     id int not null auto_increment,
    ->     gender char(1),
    ->     primary key (id),
    ->     key (gender)
    -> ) ENGINE=InnoDB;
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO mf (gender) VALUES
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('F'),('F'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('F'),('M'),('M'),('M'),('M'),('M'),('M'),('M');
Query OK, 40 rows affected (0.06 sec)
Records: 40  Duplicates: 0  Warnings: 0

mysql> ANALYZE TABLE mf;
+---------+---------+----------+----------+
| Table   | Op      | Msg_type | Msg_text |
+---------+---------+----------+----------+
| test.mf | analyze | status   | OK       |
+---------+---------+----------+----------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT gender FROM mf WHERE gender='F';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |    3 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT gender FROM mf WHERE gender='M';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |   37 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT id FROM mf WHERE gender='F';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |    3 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT id FROM mf WHERE gender='M';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |   37 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql>

TEST MyISAM

mysql> USE test
Database changed
mysql> DROP TABLE IF EXISTS mf;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE mf
    -> (
    ->     id int not null auto_increment,
    ->     gender char(1),
    ->     primary key (id),
    ->     key (gender)
    -> ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO mf (gender) VALUES
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('F'),('F'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('F'),('M'),('M'),('M'),('M'),('M'),('M'),('M');
Query OK, 40 rows affected (0.00 sec)
Records: 40  Duplicates: 0  Warnings: 0

mysql> ANALYZE TABLE mf;
+---------+---------+----------+----------+
| Table   | Op      | Msg_type | Msg_text |
+---------+---------+----------+----------+
| test.mf | analyze | status   | OK       |
+---------+---------+----------+----------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT gender FROM mf WHERE gender='F';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |    3 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT gender FROM mf WHERE gender='M';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |   36 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT id FROM mf WHERE gender='F';
+----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |    3 | Using where |
+----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT id FROM mf WHERE gender='M';
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | mf    | ALL  | gender        | NULL | NULL    | NULL |   40 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql>

Analysis for InnoDB

When the data was loaded as InnoDB, please note that all four EXPLAIN plans used the gender index. The third and fourth EXPLAIN plans used the gender index even though the requested data was id. Why? Because id is in the PRIMARY KEY and all secondary indexes have reference pointers back to the PRIMARY KEY (via the gen_clust_index).

Analysis for MyISAM

When the data was loaded as MyISAM, please note that the first three EXPLAIN plans used the gender index. In the fourth EXPLAIN plan, the Query Optimizer decided not to use an index at all. It opted for a full table scan instead. Why?

Regardless of DBMS, Query Optimizers operate on a very simple rule-of-thumb: If an index is being screened as a candidate to be used for performing the lookup and Query Optimizer computes that it must lookup more than 5% of the total number of rows in the table:

  • a full index scan is done if all needed columns for retrieval are in the selected index
  • a full table scan otherwise

CONCLUSION

If you do not have proper covering indexes, or if the key population for any given tuple is more than 5% of the table, six things must happen:

  1. Come to the realization that you must profile the queries
  2. Find all WHERE, GROUP BY, and ORDER BY` clauses from those Queries
  3. Formulate indexes in this order
    • WHERE clause columns with static values
    • GROUP BY columns
    • ORDER BY columns
  4. Avoid Full Table Scans (Queries lacking a sensible WHERE clause)
  5. Avoid Bad Key Populations (or at least cache those Bad Key Populations)
  6. Decide on the best MySQL Storage Engine (InnoDB or MyISAM) for the Tables

I have written about this 5% rule of thumb in the past:

UPDATE 2012-11-14 13:05 EDT

I took a look back at your question and at the original SO post. Then, I thought about my Analysis for InnoDB I mentioned before. It coincides with the person table. Why?

For both tables mf and person

  • Storage Engine is InnoDB
  • Primary Key is id
  • Table access is by secondary index
  • If table was MyISAM, we would see a completely different EXPLAIN plan

Now, look at the query from the SO question : select * from person order by age\G. Since there is no WHERE clause, you explicitly demanded a full table scan. The default sort order of the table would be by id (PRIMARY KEY) because of its auto_increment and the gen_clust_index (aka Clustered Index) is ordered by internal rowid. When you ordered by the index, keep in mind that InnoDB secondary indexes have the rowid attached to each index entry. This produces the internal need for full row access each time.

Setting up ORDER BY on an InnoDB table can be a rather daunting task if you ignore these facts about how InnoDB indexes are organized.

Going back to that SO query, since you explicitly demanded a full table scan, IMHO the MySQL Query Optimizer did the correct thing (or at least, chose the path of least resistance). When it comes to InnoDB and the SO query, it is far easier to perform a full table scan and then some filesort rather than doing a full index scan and a row lookup via the gen_clust_index for each secondary index entry.

I am not an advocate of using Index Hints because it ignores the EXPLAIN plan. Notwithstanding, if you really know your data better than InnoDB, you will have to resort to Index Hints, especially with queries that have no WHERE clause.

UPDATE 2012-11-14 14:21 EDT

According to the book Understanding MySQL Internals

enter image description here

Page 202 Paragraph 7 says the following:

The data is stored in a special structure called a clustered index, which is a B-tree with the primary key acting as the key value, and the actual record (rather than a pointer) in the data part. Thus, each InnoDB table must have a primary key. If one is not supplied, a special row ID column not normally visible to the user is added to act as a primary key. A secondary key will store the value of the primary key that identifies the record. The B-tree code can be found in innobase/btr/btr0btr.c.

This is why I stated earlier : it is far easier to perform a full table scan and then some filesort rather than doing a full index scan and a row lookup via the gen_clust_index for each secondary index entry. InnoDB is going to do a double index lookup every time. That sounds kind of brutal, but that's just the facts. Again, take into consideration the lack of WHERE clause. This, in itself, is the hint to the MySQL Query Optimizer to do a full table scan.