Mysql – Must an index cover all selected columns for it to be used for ORDER BY

indexinnodbMySQLorder-by

Over at SO, someone recently asked Why isn't ORDER BY using the index?

The situation involved a simple InnoDB table in MySQL comprising three columns and 10k rows. One of the columns, an integer, was indexed—and the OP sought to retrieve his entire table sorted on that column:

SELECT * FROM person ORDER BY age

He attached EXPLAIN output showing that this query was resolved with a filesort (rather than the index) and asked why that would be.

Despite the hint FORCE INDEX FOR ORDER BY (age) causing the index to be used, someone answered (with supporting comments/upvotes from others) that an index is only used for sorting when the selected columns are all read from the index (i.e. as would normally be indicated by Using index in the Extra column of EXPLAIN output). An explanation was later given that traversing the index and then fetching columns from the table results in random I/O, which MySQL views as more expensive than a filesort.

This appears to fly in the face of the manual chapter on ORDER BY Optimization, which not only conveys the strong impression that satisfying ORDER BY from an index is preferable to performing additional sorting (indeed, filesort is a combination of quicksort and mergesort and therefore must have a lower bound of Ω(nlog n); whilst walking through the index in order and seeking into the table ought to be O(n)—so this makes perfect sense), but it also neglects to mention this alleged "optimisation" whilst also stating:

The following queries use the index to resolve the ORDER BY part:
SELECT * FROM t1
  ORDER BY key_part1,key_part2,... ;

To my reading, that is precisely the case in this situation (yet the index was not being used without an explicit hint).

My questions are:

Is it indeed necessary for all selected columns to be indexed in order for MySQL to choose to use the index?
- If so, where is this documented (if at all)?
- If not, what was going on here?

Best Answer

Is it indeed necessary for all selected columns to be indexed in order for MySQL to choose to use the index?

This is a loaded question because there are factors that determine whether an index is worth using.

FACTOR #1

For any given index, what is the key population? In other words, what is the cardinality (distinct count) of all tuples recorded in the index?

FACTOR #2

What storage engine are you using? Are all needed columns accessible from an index?

WHAT'S NEXT ???

Let's take a simple example: a table that holds two values (Male and Female)

Let create such a table with a test for index usage

USE test
DROP TABLE IF EXISTS mf;
CREATE TABLE mf
(
    id int not null auto_increment,
    gender char(1),
    primary key (id),
    key (gender)
) ENGINE=InnODB;
INSERT INTO mf (gender) VALUES
('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
('M'),('M'),('M'),('M'),('F'),('F'),('M'),('M'),
('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
('F'),('M'),('M'),('M'),('M'),('M'),('M'),('M');
ANALYZE TABLE mf;
EXPLAIN SELECT gender FROM mf WHERE gender='F';
EXPLAIN SELECT gender FROM mf WHERE gender='M';
EXPLAIN SELECT id FROM mf WHERE gender='F';
EXPLAIN SELECT id FROM mf WHERE gender='M';

TEST InnoDB

mysql> USE test
Database changed
mysql> DROP TABLE IF EXISTS mf;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE mf
    -> (
    ->     id int not null auto_increment,
    ->     gender char(1),
    ->     primary key (id),
    ->     key (gender)
    -> ) ENGINE=InnoDB;
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO mf (gender) VALUES
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('F'),('F'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('F'),('M'),('M'),('M'),('M'),('M'),('M'),('M');
Query OK, 40 rows affected (0.06 sec)
Records: 40  Duplicates: 0  Warnings: 0

mysql> ANALYZE TABLE mf;
+---------+---------+----------+----------+
| Table   | Op      | Msg_type | Msg_text |
+---------+---------+----------+----------+
| test.mf | analyze | status   | OK       |
+---------+---------+----------+----------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT gender FROM mf WHERE gender='F';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |    3 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT gender FROM mf WHERE gender='M';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |   37 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT id FROM mf WHERE gender='F';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |    3 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT id FROM mf WHERE gender='M';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |   37 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql>

TEST MyISAM

mysql> USE test
Database changed
mysql> DROP TABLE IF EXISTS mf;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE mf
    -> (
    ->     id int not null auto_increment,
    ->     gender char(1),
    ->     primary key (id),
    ->     key (gender)
    -> ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO mf (gender) VALUES
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('F'),('F'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('M'),('M'),('M'),('M'),('M'),('M'),('M'),('M'),
    -> ('F'),('M'),('M'),('M'),('M'),('M'),('M'),('M');
Query OK, 40 rows affected (0.00 sec)
Records: 40  Duplicates: 0  Warnings: 0

mysql> ANALYZE TABLE mf;
+---------+---------+----------+----------+
| Table   | Op      | Msg_type | Msg_text |
+---------+---------+----------+----------+
| test.mf | analyze | status   | OK       |
+---------+---------+----------+----------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT gender FROM mf WHERE gender='F';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |    3 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT gender FROM mf WHERE gender='M';
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |   36 | Using where; Using index |
+----+-------------+-------+------+---------------+--------+---------+-------+------+--------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT id FROM mf WHERE gender='F';
+----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra       |
+----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
|  1 | SIMPLE      | mf    | ref  | gender        | gender | 2       | const |    3 | Using where |
+----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT id FROM mf WHERE gender='M';
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | mf    | ALL  | gender        | NULL | NULL    | NULL |   40 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql>

Analysis for InnoDB

When the data was loaded as InnoDB, please note that all four EXPLAIN plans used the gender index. The third and fourth EXPLAIN plans used the gender index even though the requested data was id. Why? Because id is in the PRIMARY KEY and all secondary indexes have reference pointers back to the PRIMARY KEY (via the gen_clust_index).

Analysis for MyISAM

When the data was loaded as MyISAM, please note that the first three EXPLAIN plans used the gender index. In the fourth EXPLAIN plan, the Query Optimizer decided not to use an index at all. It opted for a full table scan instead. Why?

Regardless of DBMS, Query Optimizers operate on a very simple rule-of-thumb: If an index is being screened as a candidate to be used for performing the lookup and Query Optimizer computes that it must lookup more than 5% of the total number of rows in the table:

a full index scan is done if all needed columns for retrieval are in the selected index
a full table scan otherwise

CONCLUSION

If you do not have proper covering indexes, or if the key population for any given tuple is more than 5% of the table, six things must happen:

Come to the realization that you must profile the queries
Find all WHERE, GROUP BY, and ORDER BY` clauses from those Queries
Formulate indexes in this order
- WHERE clause columns with static values
- GROUP BY columns
- ORDER BY columns
Avoid Full Table Scans (Queries lacking a sensible WHERE clause)
Avoid Bad Key Populations (or at least cache those Bad Key Populations)
Decide on the best MySQL Storage Engine (InnoDB or MyISAM) for the Tables

I have written about this 5% rule of thumb in the past:

May 07, 2012 : MySQL EXPLAIN doesn't show 'use index' for FULLTEXT
Mar 22, 2012 : Why does MySQL choose this execution plan?
Mar 09, 2012 : index not being used
Jan 18, 2012 : MySQL status variable Handler_read_rnd_next is growing a lot
Dec 27, 2011 : MySQL - fastest way to ALTER TABLE for InnoDB
Jul 29, 2011 : MySQL Query Optimization : Indexing and Pagination
Jul 12, 2011 : MySQL very slow query when changing one WHERE field despite no index/key

UPDATE 2012-11-14 13:05 EDT

I took a look back at your question and at the original SO post. Then, I thought about my Analysis for InnoDB I mentioned before. It coincides with the person table. Why?

For both tables mf and person

Storage Engine is InnoDB
Primary Key is id
Table access is by secondary index
If table was MyISAM, we would see a completely different EXPLAIN plan

Now, look at the query from the SO question : select * from person order by age\G. Since there is no WHERE clause, you explicitly demanded a full table scan. The default sort order of the table would be by id (PRIMARY KEY) because of its auto_increment and the gen_clust_index (aka Clustered Index) is ordered by internal rowid. When you ordered by the index, keep in mind that InnoDB secondary indexes have the rowid attached to each index entry. This produces the internal need for full row access each time.

Setting up ORDER BY on an InnoDB table can be a rather daunting task if you ignore these facts about how InnoDB indexes are organized.

Going back to that SO query, since you explicitly demanded a full table scan, IMHO the MySQL Query Optimizer did the correct thing (or at least, chose the path of least resistance). When it comes to InnoDB and the SO query, it is far easier to perform a full table scan and then some filesort rather than doing a full index scan and a row lookup via the gen_clust_index for each secondary index entry.

I am not an advocate of using Index Hints because it ignores the EXPLAIN plan. Notwithstanding, if you really know your data better than InnoDB, you will have to resort to Index Hints, especially with queries that have no WHERE clause.

UPDATE 2012-11-14 14:21 EDT

According to the book Understanding MySQL Internals

enter image description here

Page 202 Paragraph 7 says the following:

The data is stored in a special structure called a clustered index, which is a B-tree with the primary key acting as the key value, and the actual record (rather than a pointer) in the data part. Thus, each InnoDB table must have a primary key. If one is not supplied, a special row ID column not normally visible to the user is added to act as a primary key. A secondary key will store the value of the primary key that identifies the record. The B-tree code can be found in innobase/btr/btr0btr.c.

This is why I stated earlier : it is far easier to perform a full table scan and then some filesort rather than doing a full index scan and a row lookup via the gen_clust_index for each secondary index entry. InnoDB is going to do a double index lookup every time. That sounds kind of brutal, but that's just the facts. Again, take into consideration the lack of WHERE clause. This, in itself, is the hint to the MySQL Query Optimizer to do a full table scan.

Related Solutions

Algorithmic order of table indexes for table operations

Things are more complicated than that. Here are a few points of consideration.

First, this entire discussion assumes B-Trees or B+ Trees (Hence the o(log(n))). There are other types of indexes, like hash indexes, where access is in O(1). Your question insinuates you're looking up values using "equals" search (e.g. looking for X=17). But in this particular scenario, a Hash index is preferable, when possible.

I do agree though that most indexes you'll find today are B/B+ Trees, so let's continue with this assumption.

You've also implicitly indicated that there's always only one resulting row in a SELECT, which is hardly a representative case; plenty times do we look for 1,000 rows at a time. But let's continue with the assumption of a single matching row.

Your next assumption is that searches are always done by the indexed column. This is fine, but I'm just noting that DELETEing a record by some unindexed column Y turns out to be more expensive: you're both wasting O(n) time in finding the record, and then paying an additional O(log(n)) for updating the index.

But let us continue with the assumption that we're only discussing queries which are looking up at indexed columns.

Some tables use unclustered index format (which fits into your calculations) - the table is one entity and the index is another. Others use clustered index format: table rows (or rather blocks of rows, or rather yet pages of rows) are actually stored as leaves inside the clustering index. In such scenario, you will pay O(log(n)) for finding a record in an INSERT command. An optimization for that is in the case you're inserting to the end of the table, and a decent implementation would hold a pointer to the last record/page in the index. (Oh, yes, you should not the possibility that your record gets INSERTed to the middle of the table).

Actually, records could get inserted to the middle of the table even in unclustered tables; it would make sense to spend more search time so as to avoid fragmentation, and at least one implementation that I know of does that. I'm assuming other may, too.

Deleting/Inserting an index from a BTree is O(1) on average, but may cost up to log(n) operations in case of propagated page merging/splitting.

Also, the fact that always comes as a surprise to many, is that sometimes a table scan is faster than index lookup. This is particularly true for queries resulting in multiple rows. It turns out looking up the index adds overhead; when compared to full table scan the overhead could actually make total cost higher. For single row lookup the vast majority of index lookups should be faster than a full table scan.

But do consider the following general convention: you pay with dollars any action that accesses disk. You pay with nickels actions that act on in memory data. This is actually at the heart of database disk I/O optimization. If index pages are on disk, and table pages happen to be in memory, you will possibly pay less for table scan.

And that's where havoc comes in: it really all depends on your workloads, on your memory size, on your dataset size.

Did you ever take a math class where you had to solve some complex integral? It took you hours to solve it, and got point off for missing some tiny minus sign somewhere?

Did you even take a physics class where you had to solve some complex integral? The professor would just throw away chunks of the equation, saying "this is neglect-able", and you would rip your hair off? WHY is it neglect-able? Why not other things?

Computer science is based on math. Computers are based on physics. They are physical objects. They need to spin disks, access a memory bus, manage billions of transistors... You just can't actually anticipate what will happen and put it all under some equation.

It may just turn out that for some particular dataset your entire equation does not hold water. In other times, it may be just fine.

MySQL InnoDB Sorting Issue

I have some rather distressing news: ORDER BY can still wreak some havoc with filesorts.

With all the hype about this being addressed and fixed, there is simply no way to get InnoDB to effectively use the index on an ORDER BY.

Start with the Ground Zero of InnoDB row data, the Clustered Index.

Rows are tagged with

a 6-byte transaction ID field
a 7-byte roll pointer field

Rows tend to be ordered by the whatever order the data was entered. The columns of a PRIMARY KEY are included in secondary indexes and are used to search for rows in the Clustered Index, Unfortunately, the two ID fields are not really used in dictating any ordering of rows within the Clustered Index. (For more info, please see MySQL Documentation on InnoDB Physical Row Structure)

Here is something even more disturbing: Did you know you could order rows in a table by the columns of the PRIMARY KEY or any arbitrary ordering you choose?

Here is the syntax:

ALTER TABLE tblname ORDER BY col_name [, col_name] ...

This could speed up some queries that a PRIMARY KEY ordered, but what's disturbing is that it only applies to MyISAM tables. Why not InnoDB ?

According to the MySQL Documentation on ALTER TABLE ... ORDER BY:

ORDER BY enables you to create the new table with the rows in a specific order. Note that the table does not remain in this order after inserts and deletes. This option is useful primarily when you know that you are mostly to query the rows in a certain order most of the time. By using this option after major changes to the table, you might be able to get higher performance. In some cases, it might make sorting easier for MySQL if the table is in order by the column that you want to order it by later.

ORDER BY syntax permits one or more column names to be specified for sorting, each of which optionally can be followed by ASC or DESC to indicate ascending or descending sort order, respectively. The default is ascending order. Only column names are permitted as sort criteria; arbitrary expressions are not permitted. This clause should be given last after any other clauses.

ORDER BY does not make sense for InnoDB tables because InnoDB always orders table rows according to the clustered index.

This comes as no surprise to me since I mentioned this in one of my earlier posts : (Aug 29, 2011 : Preordering the table by a specified column)

Therefore, doing an ORDER BY on an InnoDB table never guarantees proper index selection due to its internal index organization. Thus, one should not be surprised by a filesort on an InnoDB table no matter what secondary indexes the table has.