Things are more complicated than that. Here are a few points of consideration.
First, this entire discussion assumes B-Trees or B+ Trees (Hence the o(log(n))). There are other types of indexes, like hash indexes, where access is in O(1). Your question insinuates you're looking up values using "equals" search (e.g. looking for X=17
). But in this particular scenario, a Hash index is preferable, when possible.
I do agree though that most indexes you'll find today are B/B+ Trees, so let's continue with this assumption.
You've also implicitly indicated that there's always only one resulting row in a SELECT
, which is hardly a representative case; plenty times do we look for 1,000 rows at a time. But let's continue with the assumption of a single matching row.
Your next assumption is that searches are always done by the indexed column. This is fine, but I'm just noting that DELETE
ing a record by some unindexed column Y
turns out to be more expensive: you're both wasting O(n) time in finding the record, and then paying an additional O(log(n)) for updating the index.
But let us continue with the assumption that we're only discussing queries which are looking up at indexed columns.
Some tables use unclustered index format (which fits into your calculations) - the table is one entity and the index is another. Others use clustered index format: table rows (or rather blocks of rows, or rather yet pages of rows) are actually stored as leaves inside the clustering index. In such scenario, you will pay O(log(n)) for finding a record in an INSERT
command. An optimization for that is in the case you're inserting to the end of the table, and a decent implementation would hold a pointer to the last record/page in the index. (Oh, yes, you should not the possibility that your record gets INSERT
ed to the middle of the table).
Actually, records could get inserted to the middle of the table even in unclustered tables; it would make sense to spend more search time so as to avoid fragmentation, and at least one implementation that I know of does that. I'm assuming other may, too.
Deleting/Inserting an index from a BTree is O(1) on average, but may cost up to log(n) operations in case of propagated page merging/splitting.
Also, the fact that always comes as a surprise to many, is that sometimes a table scan is faster than index lookup. This is particularly true for queries resulting in multiple rows. It turns out looking up the index adds overhead; when compared to full table scan the overhead could actually make total cost higher. For single row lookup the vast majority of index lookups should be faster than a full table scan.
But do consider the following general convention: you pay with dollars any action that accesses disk. You pay with nickels actions that act on in memory data. This is actually at the heart of database disk I/O optimization. If index pages are on disk, and table pages happen to be in memory, you will possibly pay less for table scan.
And that's where havoc comes in: it really all depends on your workloads, on your memory size, on your dataset size.
Did you ever take a math class where you had to solve some complex integral? It took you hours to solve it, and got point off for missing some tiny minus sign somewhere?
Did you even take a physics class where you had to solve some complex integral? The professor would just throw away chunks of the equation, saying "this is neglect-able", and you would rip your hair off? WHY is it neglect-able? Why not other things?
Computer science is based on math. Computers are based on physics. They are physical objects. They need to spin disks, access a memory bus, manage billions of transistors... You just can't actually anticipate what will happen and put it all under some equation.
It may just turn out that for some particular dataset your entire equation does not hold water. In other times, it may be just fine.
I have some rather distressing news: ORDER BY can still wreak some havoc with filesorts.
With all the hype about this being addressed and fixed, there is simply no way to get InnoDB to effectively use the index on an ORDER BY.
Start with the Ground Zero of InnoDB row data, the Clustered Index.
Rows are tagged with
- a 6-byte transaction ID field
- a 7-byte roll pointer field
Rows tend to be ordered by the whatever order the data was entered. The columns of a PRIMARY KEY are included in secondary indexes and are used to search for rows in the Clustered Index, Unfortunately, the two ID fields are not really used in dictating any ordering of rows within the Clustered Index. (For more info, please see MySQL Documentation on InnoDB Physical Row Structure)
Here is something even more disturbing: Did you know you could order rows in a table by the columns of the PRIMARY KEY or any arbitrary ordering you choose?
Here is the syntax:
ALTER TABLE tblname ORDER BY col_name [, col_name] ...
This could speed up some queries that a PRIMARY KEY ordered, but what's disturbing is that it only applies to MyISAM tables. Why not InnoDB ?
According to the MySQL Documentation on ALTER TABLE ... ORDER BY
:
ORDER BY enables you to create the new table with the rows in a
specific order. Note that the table does not remain in this order
after inserts and deletes. This option is useful primarily when you
know that you are mostly to query the rows in a certain order most of
the time. By using this option after major changes to the table, you
might be able to get higher performance. In some cases, it might make
sorting easier for MySQL if the table is in order by the column that
you want to order it by later.
ORDER BY syntax permits one or more column names to be specified for
sorting, each of which optionally can be followed by ASC or DESC to
indicate ascending or descending sort order, respectively. The default
is ascending order. Only column names are permitted as sort criteria;
arbitrary expressions are not permitted. This clause should be given
last after any other clauses.
ORDER BY does not make sense for InnoDB tables because InnoDB always
orders table rows according to the clustered index.
This comes as no surprise to me since I mentioned this in one of my earlier posts : (Aug 29, 2011
: Preordering the table by a specified column)
Therefore, doing an ORDER BY
on an InnoDB table never guarantees proper index selection due to its internal index organization. Thus, one should not be surprised by a filesort on an InnoDB table no matter what secondary indexes the table has.
Best Answer
This is a loaded question because there are factors that determine whether an index is worth using.
FACTOR #1
For any given index, what is the key population? In other words, what is the cardinality (distinct count) of all tuples recorded in the index?
FACTOR #2
What storage engine are you using? Are all needed columns accessible from an index?
WHAT'S NEXT ???
Let's take a simple example: a table that holds two values (Male and Female)
Let create such a table with a test for index usage
TEST InnoDB
TEST MyISAM
Analysis for InnoDB
When the data was loaded as InnoDB, please note that all four
EXPLAIN
plans used thegender
index. The third and fourthEXPLAIN
plans used thegender
index even though the requested data wasid
. Why? Becauseid
is in thePRIMARY KEY
and all secondary indexes have reference pointers back to thePRIMARY KEY
(via the gen_clust_index).Analysis for MyISAM
When the data was loaded as MyISAM, please note that the first three
EXPLAIN
plans used thegender
index. In the fourthEXPLAIN
plan, the Query Optimizer decided not to use an index at all. It opted for a full table scan instead. Why?Regardless of DBMS, Query Optimizers operate on a very simple rule-of-thumb: If an index is being screened as a candidate to be used for performing the lookup and Query Optimizer computes that it must lookup more than 5% of the total number of rows in the table:
CONCLUSION
If you do not have proper covering indexes, or if the key population for any given tuple is more than 5% of the table, six things must happen:
WHERE
,GROUP BY
, and ORDER BY` clauses from those QueriesWHERE
clause columns with static valuesGROUP BY
columnsORDER BY
columnsWHERE
clause)I have written about this 5% rule of thumb in the past:
May 07, 2012
: MySQL EXPLAIN doesn't show 'use index' for FULLTEXTMar 22, 2012
: Why does MySQL choose this execution plan?Mar 09, 2012
: index not being usedJan 18, 2012
: MySQL status variable Handler_read_rnd_next is growing a lotDec 27, 2011
: MySQL - fastest way to ALTER TABLE for InnoDBJul 29, 2011
: MySQL Query Optimization : Indexing and PaginationJul 12, 2011
: MySQL very slow query when changing one WHERE field despite no index/keyUPDATE 2012-11-14 13:05 EDT
I took a look back at your question and at the original SO post. Then, I thought about my
Analysis for InnoDB
I mentioned before. It coincides with theperson
table. Why?For both tables
mf
andperson
id
EXPLAIN
planNow, look at the query from the SO question :
select * from person order by age\G
. Since there is noWHERE
clause, you explicitly demanded a full table scan. The default sort order of the table would be byid
(PRIMARY KEY) because of its auto_increment and the gen_clust_index (aka Clustered Index) is ordered by internal rowid. When you ordered by the index, keep in mind that InnoDB secondary indexes have the rowid attached to each index entry. This produces the internal need for full row access each time.Setting up
ORDER BY
on an InnoDB table can be a rather daunting task if you ignore these facts about how InnoDB indexes are organized.Going back to that SO query, since you explicitly demanded a full table scan, IMHO the MySQL Query Optimizer did the correct thing (or at least, chose the path of least resistance). When it comes to InnoDB and the SO query, it is far easier to perform a full table scan and then some
filesort
rather than doing a full index scan and a row lookup via the gen_clust_index for each secondary index entry.I am not an advocate of using Index Hints because it ignores the EXPLAIN plan. Notwithstanding, if you really know your data better than InnoDB, you will have to resort to Index Hints, especially with queries that have no
WHERE
clause.UPDATE 2012-11-14 14:21 EDT
According to the book Understanding MySQL Internals
Page 202 Paragraph 7 says the following:
This is why I stated earlier : it is far easier to perform a full table scan and then some filesort rather than doing a full index scan and a row lookup via the gen_clust_index for each secondary index entry. InnoDB is going to do a double index lookup every time. That sounds kind of brutal, but that's just the facts. Again, take into consideration the lack of
WHERE
clause. This, in itself, is the hint to the MySQL Query Optimizer to do a full table scan.