How does LIKE '%123456789%' benefit from indexing?
Only a little bit. The query processor can scan the whole nonclustered index looking for matches instead of the entire table (the clustered index). Nonclustered indexes are generally smaller than the table they are built on, so scanning the nonclustered index may be faster.
The downside, is that any columns needed by the query that are not included in the nonclustered index definition must be looked up in the base table, per row.
The optimizer makes a decision between scanning the table (clustered index) and scanning the nonclustered index with lookups, based on cost estimates. The estimated costs depend to a great extent on how many rows the optimizer expects your LIKE
or CHARINDEX
predicate to select.
Why do the listed articles state that it will not improve performance?
For a LIKE
condition that does not start with a wildcard, SQL Server can perform a partial scan of the index instead of scanning the whole thing. For example, LIKE 'A%
can be correctly evaluated by testing only index records >= 'A'
and < 'B'
(the exact boundary values depend on collation).
This sort of query can use the seeking ability of b-tree indexes: we can go straight to the first record >= 'A'
using the b-tree, then scan forward in index key order until we reach a record that fails the < 'B'
test. Since we only need to apply the LIKE
test to a smaller number of rows, performance is generally better.
By contrast, LIKE '%A
cannot be turned into a partial scan because we don't know where to start or end; any record could end in 'A'
, so we cannot improve on scanning the whole index and testing every row individually.
I tried rewriting the query to use CHARINDEX
, but performance is still slow. Why does CHARINDEX
not benefit from the indexing as it appears the LIKE query does?
The query optimizer has the same choice between scanning the table (clustered index) and scanning the nonclustered index (with lookups) in both cases.
The choice is made between the two based on cost estimation. It so happens that SQL Server may produce a different estimate for the two methods. For the LIKE
form of the query, the estimate may be able to use special string statistics to produce a reasonably accurate estimate. The CHARINDEX > 0
form produces an estimate based on a guess.
The different estimates are enough to make the optimizer choose a Clustered Index Scan for CHARINDEX
and a NonClustered Index Scan with Lookups for the LIKE
. If you force the CHARINDEX
query to use the nonclustered index with a hint, you will get the same plan as for LIKE
, and performance will be about the same:
SELECT
[Customer name],
[Sl_No],
[Id]
FROM dbo.customer WITH (INDEX (f))
WHERE
CHARINDEX('9000413237', [Phone no]) >0;
The number of rows processed at runtime will be the same for both methods, it's just that the LIKE
form produces a more accurate estimation in this case, so the query optimizer chooses a better plan.
If you find yourself needing LIKE %thing%
searches often, you might want to consider a technique I wrote about in Trigram Wildcard String Search in SQL Server.
If you only need to know that a row is distinct, and don't need the actual contents of col3, then perhaps returning the hash of col3 would speed up the query?
You could even perhaps pre-compute the hash using a calculated column so that you aren't computing the hash on the fly.
If you do need the contents of col3, but have a lot of duplicates of col1+col2+col3, then it still may be beneficial to work with the hash to remove duplicates as a sub-query, then only return the col3 contents for the distinct rows.
Best Answer
There is no particular problem with using the trigram technique I wrote about in Trigram Wildcard String Search in SQL Server with multiple columns, dynamic SQL, or 200 million records.
The comments on that article show people using the basic technique with multiple columns, much larger tables, and even with encryption. The modifications needed are normally quite straightforward, assuming the person making the changes understands the basic underlying implementation shown in my article.
Note though that trigram search requires at least a three-character substring to search for (as the name suggests). If you really need to perform two-character searches (as in the question example), more modifications would be needed, and you would need to carefully assess the costs and benefits with your data. Without further information, it would seem that a two-character match on hundreds of millions of rows would produce a great number of matches.
If n-gram search is not suitable for you, and you can tolerate some latency, an external solution like Elasticsearch might be worth looking into.
Example
The code below shows one way to extend the basic trigram search to work with three string columns in the same table. For brevity, it does not implement the trigger logic shown in my article to keep the trigrams synchronized with the underlying data. If this is required, the modifications needed follow much the same pattern.
That particular implementation choice allows trigram searches with the column name as a parameter: