Regardless of the Storage Engine (MyISAM or InnoDB), when it comes to BTREEs, you must make sure you understand the following characteristics:
- Keys should be as small as possible
- Random Keys for PRIMARY KEYs
- Insertions (Bulk or Programmatic) will perform root node and internal splitting periodically
- Introduces Overhead early in an index's life
- Breeds node fragmentation (especially for index pages)
- Causes Index Scans for Queries to be performed Randomly
- Ordered Keys for PRIMARY KEYs
- Bulk Ordered Insertions delay root node and internal splitting
- Reloading data via
mysqldump
files and LOAD DATA INFILE
commands promote the use of sorting mechanisms to address index initialization/reorganization (See my Oct 26, 2012 post: How badly does innodb fragment in the face of somewhat out-of-order insertions?)
- Programmatic Ordered Insertions promote root node and internal splitting of index pages in 45% of the cases
- Delays creation of Overhead
- Prevents node fragmentation
- Causes Index Scans for Queries to be performed Sequentially (less disk I/O)
When it comes to BTREEs in InnoDB, they tend to be more bloated than that of its counterparts MyISAM because of InnoDB's gen_clust_index, where row data live.
The PRIMARY KEY of an InnoDB table points right to its gen_clust_index. Secondary indexes always include a PRIMARY KEY entry. If you run a query that uses a Secondary Index and also has non-indexed columns in the WHERE clause, you could easily be doing two Index Lookups. With that in mind, you need to make sure all Secondary Indexes have all the needed columns for you queries' WHERE clauses (a.k.a. Covering Index).
How does LIKE '%123456789%' benefit from indexing?
Only a little bit. The query processor can scan the whole nonclustered index looking for matches instead of the entire table (the clustered index). Nonclustered indexes are generally smaller than the table they are built on, so scanning the nonclustered index may be faster.
The downside, is that any columns needed by the query that are not included in the nonclustered index definition must be looked up in the base table, per row.
The optimizer makes a decision between scanning the table (clustered index) and scanning the nonclustered index with lookups, based on cost estimates. The estimated costs depend to a great extent on how many rows the optimizer expects your LIKE
or CHARINDEX
predicate to select.
Why do the listed articles state that it will not improve performance?
For a LIKE
condition that does not start with a wildcard, SQL Server can perform a partial scan of the index instead of scanning the whole thing. For example, LIKE 'A%
can be correctly evaluated by testing only index records >= 'A'
and < 'B'
(the exact boundary values depend on collation).
This sort of query can use the seeking ability of b-tree indexes: we can go straight to the first record >= 'A'
using the b-tree, then scan forward in index key order until we reach a record that fails the < 'B'
test. Since we only need to apply the LIKE
test to a smaller number of rows, performance is generally better.
By contrast, LIKE '%A
cannot be turned into a partial scan because we don't know where to start or end; any record could end in 'A'
, so we cannot improve on scanning the whole index and testing every row individually.
I tried rewriting the query to use CHARINDEX
, but performance is still slow. Why does CHARINDEX
not benefit from the indexing as it appears the LIKE query does?
The query optimizer has the same choice between scanning the table (clustered index) and scanning the nonclustered index (with lookups) in both cases.
The choice is made between the two based on cost estimation. It so happens that SQL Server may produce a different estimate for the two methods. For the LIKE
form of the query, the estimate may be able to use special string statistics to produce a reasonably accurate estimate. The CHARINDEX > 0
form produces an estimate based on a guess.
The different estimates are enough to make the optimizer choose a Clustered Index Scan for CHARINDEX
and a NonClustered Index Scan with Lookups for the LIKE
. If you force the CHARINDEX
query to use the nonclustered index with a hint, you will get the same plan as for LIKE
, and performance will be about the same:
SELECT
[Customer name],
[Sl_No],
[Id]
FROM dbo.customer WITH (INDEX (f))
WHERE
CHARINDEX('9000413237', [Phone no]) >0;
The number of rows processed at runtime will be the same for both methods, it's just that the LIKE
form produces a more accurate estimation in this case, so the query optimizer chooses a better plan.
If you find yourself needing LIKE %thing%
searches often, you might want to consider a technique I wrote about in Trigram Wildcard String Search in SQL Server.
Best Answer
Indexing strategy tends to evolve as usage patterns emerge. That said, there are also strategies and design guidelines that can be applied up front.
Choose a good clustering key. You can usually determine the appropriate clustered index at design time, based on the expected pattern of inserts to a table. If a compelling case emerges for a change in the future, so be it.
Create your primary and other unique constraints. These will be enforced by unique indexes.
Create your foreign keys and associated non-clustered indexes. Foreign keys are your most frequently referenced join columns, so index them from the start.
Create indexes for any obviously highly selective queries. For query patterns you already know will be highly selective and likely to use lookups rather than scans.
Beyond the above, take a gradual and holistic approach to implementing new indexes. By holistic, I mean assess the potential benefit and impact to all queries and existing indexes when evaluating an addition.
A not uncommon problem in SQL Server circles is overindexing, as a result of guidance from the missing index DMVs and SSMS hints. Neither of these tools evaluate existing indexes and will merrily suggest you create a new 6 column index rather than add a single column to an existing 5 column index.
Kimberly Tripp has some excellent material on indexing strategy that while SQL focused is applicable to other platforms. For the SQL Server folk, there are some handy tools for identifying duplicates like the example above.
This usually only applies for rarely run queries, typically ETL. You need to assess: