Sql-server – Query performance of searching pattern

performancequery-performancesql serversql-server-2008-r2string-searching

Posted the same question on SO, but did not get answer.

As per Andrey Nikolov comment, tried with Trigram, but the table name and column(more than 50 columns) are dynamic i mean will get from front end.

Question:

I have the following sample data for understanding the requirement.

Table:

create table ft_test
(
    col1 int identity(1,1),
    col2 varchar(max)
);

insert into ft_test values('John A Henry');
insert into ft_test values('Dsouza mak Dee');
insert into ft_test values('Smith Ashla Don');

Creation of FULLTEXT INDEX:

CREATE UNIQUE INDEX UIX_test_Col1 ON ft_test(col1);

CREATE FULLTEXT CATALOG ftcat
WITH ACCENT_SENSITIVITY = OFF

CREATE FULLTEXT INDEX ON ft_test  
(col2 LANGUAGE 1033)  
KEY INDEX UIX_test_Col1  
ON ftcat  
WITH STOPLIST = SYSTEM

Notes:

Search for the given pattern in the given column of the table.

I have more than 200 millions of records in the table.

Created FULLTEXT index on required column.

Looking for best performing search query.

Givens:

Pattern: oh

Column to Search: col2

Expected Output:

col1    col2
----------------------
1       John A Henry

Try:

Using FULLTEXT Index: CONTAINS

SELECT col1, col2
FROM ft_test1
WHERE CONTAINS(col2, '"*oh*"')

No output.

Using FULLTEXT Index: FREETEXT

SELECT col1, col2
FROM ft_test1
WHERE FREETEXT(col2, '"*oh*"')

No output.

Using Pattern Index: PATINDEX

SELECT col1, col2
FROM ft_test1
WHERE PATINDEX('%oh%',col2)>0

Got an output.

Execution Plan: Table Scan

Using character expression: CHARINDEX

SELECT col1, col2
FROM ft_test1
WHERE CHARINDEX('oh',col2)>0

Got an output.

Execution Plan: Table Scan

Using LIKE.

SELECT col1, col2
FROM ft_test1
WHERE col2 LIKE '%oh%'

Got an output.

Execution Plan: Table Scan

Best Answer

...tried with Trigram, but the table name and column(more than 50 columns) are dynamic...

There is no particular problem with using the trigram technique I wrote about in Trigram Wildcard String Search in SQL Server with multiple columns, dynamic SQL, or 200 million records.

The comments on that article show people using the basic technique with multiple columns, much larger tables, and even with encryption. The modifications needed are normally quite straightforward, assuming the person making the changes understands the basic underlying implementation shown in my article.

Note though that trigram search requires at least a three-character substring to search for (as the name suggests). If you really need to perform two-character searches (as in the question example), more modifications would be needed, and you would need to carefully assess the costs and benefits with your data. Without further information, it would seem that a two-character match on hundreds of millions of rows would produce a great number of matches.

If n-gram search is not suitable for you, and you can tolerate some latency, an external solution like Elasticsearch might be worth looking into.

Example

The code below shows one way to extend the basic trigram search to work with three string columns in the same table. For brevity, it does not implement the trigger logic shown in my article to keep the trigrams synchronized with the underlying data. If this is required, the modifications needed follow much the same pattern.

CREATE TABLE dbo.Example 
(
    id integer IDENTITY NOT NULL,
    string1 varchar(10) NOT NULL,
    string2 varchar(20) NOT NULL,
    string3 varchar(30) NOT NULL,
 
    CONSTRAINT [PK dbo.Example (id)]
        PRIMARY KEY CLUSTERED (id)
);
GO
-- 1 million rows
INSERT dbo.Example 
    WITH (TABLOCKX)
(
    string1,
    string2,
    string3
)
SELECT TOP (1 * 1000 * 1000)
    -- 10 characters numeric
    REPLACE(STR(RAND(CHECKSUM(NEWID())) * 1e10, 10), SPACE(1), '0'),
    -- 10 characters numeric plus 10 characters [0-9A-F]
    REPLACE(STR(RAND(CHECKSUM(NEWID())) * 1e10, 10), SPACE(1), '0') + RIGHT(NEWID(), 10),
    -- 30 characters [0-9A-F]
    RIGHT(NEWID(), 10) + RIGHT(NEWID(), 10) + RIGHT(NEWID(), 10)
FROM master.dbo.spt_values AS SV1
CROSS JOIN master.dbo.spt_values AS SV2
OPTION (MAXDOP 1);
GO
--- Generate trigrams from a string
CREATE FUNCTION dbo.GenerateTrigrams
(
    @string varchar(255)
)
RETURNS table
WITH SCHEMABINDING
AS RETURN
    WITH
        N16 AS 
        (
            SELECT V.v 
            FROM 
            (
                VALUES 
                    (0),(0),(0),(0),(0),(0),(0),(0),
                    (0),(0),(0),(0),(0),(0),(0),(0)
            ) AS V (v)),
        -- Numbers table (256)
        Nums AS 
        (
            SELECT n = ROW_NUMBER() OVER (ORDER BY A.v)
            FROM N16 AS A 
            CROSS JOIN N16 AS B
        ),
        Trigrams AS
        (
            -- Every 3-character substring
            SELECT TOP (CASE WHEN LEN(@string) > 2 THEN LEN(@string) - 2 ELSE 0 END)
                trigram = SUBSTRING(@string, N.n, 3)
            FROM Nums AS N
            ORDER BY N.n
        )
    -- Remove duplicates and ensure all three characters are alphanumeric
    SELECT DISTINCT 
        T.trigram
    FROM Trigrams AS T
    WHERE
        -- Binary collation comparison so ranges work as expected
        T.trigram COLLATE Latin1_General_BIN2 NOT LIKE '%[^A-Z0-9a-z]%';
GO
-- Trigrams for Example table
CREATE TABLE dbo.Example_Trigrams
(
    id integer NOT NULL,
    column_id integer NOT NULL,
    trigram char(3) NOT NULL
);
GO
-- Generate trigrams for string1
DECLARE @column_id integer = 
    COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), N'string1', 'ColumnId');

INSERT dbo.Example_Trigrams
    WITH (TABLOCKX)
    (
        id,
        column_id,
        trigram
    )
SELECT
    E.id,
    @column_id,
    GT.trigram
FROM dbo.Example AS E
CROSS APPLY dbo.GenerateTrigrams(E.string1) AS GT;
GO
-- Generate trigrams for string2 
DECLARE @column_id integer = 
    COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), N'string2', 'ColumnId');

INSERT dbo.Example_Trigrams
    WITH (TABLOCKX)
    (
        id,
        column_id,
        trigram
    )
SELECT
    E.id,
    @column_id,
    GT.trigram
FROM dbo.Example AS E
CROSS APPLY dbo.GenerateTrigrams(E.string2) AS GT;
GO
-- Generate trigrams for string3
DECLARE @column_id integer = 
    COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), N'string3', 'ColumnId');

INSERT dbo.Example_Trigrams
    WITH (TABLOCKX)
    (
        id,
        column_id,
        trigram
    )
SELECT
    E.id,
    @column_id,
    GT.trigram
FROM dbo.Example AS E
CROSS APPLY dbo.GenerateTrigrams(E.string3) AS GT;
GO
-- Trigram search index
CREATE UNIQUE CLUSTERED INDEX
    [CUQ dbo.Example_Trigrams (column_id, trigram, id)]
ON dbo.Example_Trigrams (column_id, trigram, id)
WITH (DATA_COMPRESSION = ROW);
GO
-- Selectivity of each trigram (performance optimization)
CREATE VIEW dbo.Example_TrigramCounts
WITH SCHEMABINDING
AS
SELECT
    ET.column_id,
    ET.trigram, 
    cnt = COUNT_BIG(*)
FROM dbo.Example_Trigrams AS ET
GROUP BY
    ET.column_id,
    ET.trigram;
GO
-- Materialize the view
CREATE UNIQUE CLUSTERED INDEX
    [CUQ dbo.Example_TrigramCounts (column_id, trigram)]
ON dbo.Example_TrigramCounts (column_id, trigram);
GO
-- Most selective trigrams for a search string
-- Always returns a row (NULLs if no trigrams found)
CREATE FUNCTION dbo.Example_GetBestTrigrams
(
    @column_name sysname,
    @string varchar(255)
)
RETURNS table
WITH SCHEMABINDING AS
RETURN
    SELECT
        -- Pivot
        trigram1 = MAX(CASE WHEN BT.rn = 1 THEN BT.trigram END),
        trigram2 = MAX(CASE WHEN BT.rn = 2 THEN BT.trigram END),
        trigram3 = MAX(CASE WHEN BT.rn = 3 THEN BT.trigram END)
    FROM 
    (
        -- Generate trigrams for the search string
        -- and choose the most selective three
        SELECT TOP (3)
            rn = ROW_NUMBER() OVER (
                ORDER BY ETC.cnt ASC),
            GT.trigram
        FROM dbo.GenerateTrigrams(@string) AS GT
        JOIN dbo.Example_TrigramCounts AS ETC
            WITH (NOEXPAND)
            ON ETC.column_id = COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), @column_name, 'ColumnId')
            AND ETC.trigram = GT.trigram
        ORDER BY
            ETC.cnt ASC
    ) AS BT;
GO
-- Returns Example ids matching all provided (non-null) trigrams
CREATE FUNCTION dbo.Example_GetTrigramMatchIDs
(
    @column_name sysname,
    @Trigram1 char(3),
    @Trigram2 char(3),
    @Trigram3 char(3)
)
RETURNS @IDs table (id integer PRIMARY KEY)
WITH SCHEMABINDING AS
BEGIN
    DECLARE @column_id integer =
        COLUMNPROPERTY(OBJECT_ID(N'dbo.Example', 'U'), @column_name, 'ColumnId');

    IF  @Trigram1 IS NOT NULL
    BEGIN
        IF @Trigram2 IS NOT NULL
        BEGIN
            IF @Trigram3 IS NOT NULL
            BEGIN
                -- 3 trigrams available
                INSERT @IDs (id)
                SELECT ET1.id
                FROM dbo.Example_Trigrams AS ET1 
                WHERE ET1.trigram = @Trigram1
                AND ET1.column_id = @column_id
                INTERSECT
                SELECT ET2.id
                FROM dbo.Example_Trigrams AS ET2
                WHERE ET2.trigram = @Trigram2
                AND ET2.column_id = @column_id
                INTERSECT
                SELECT ET3.id
                FROM dbo.Example_Trigrams AS ET3
                WHERE ET3.trigram = @Trigram3
                AND ET3.column_id = @column_id
                OPTION (MERGE JOIN);
            END;
            ELSE
            BEGIN
                -- 2 trigrams available
                INSERT @IDs (id)
                SELECT ET1.id
                FROM dbo.Example_Trigrams AS ET1 
                WHERE ET1.trigram = @Trigram1
                AND ET1.column_id = @column_id
                INTERSECT
                SELECT ET2.id
                FROM dbo.Example_Trigrams AS ET2
                WHERE ET2.trigram = @Trigram2
                AND ET2.column_id = @column_id
                OPTION (MERGE JOIN);
            END;
        END;
        ELSE
        BEGIN
            -- 1 trigram available
            INSERT @IDs (id)
            SELECT ET1.id
            FROM dbo.Example_Trigrams AS ET1 
            WHERE ET1.trigram = @Trigram1
            AND ET1.column_id = @column_id;
        END;
    END;
 
    RETURN;
END;
GO
-- Search implementation
CREATE FUNCTION dbo.Example_TrigramSearch
(
    @column_name sysname,
    @search varchar(255)
)
RETURNS table
WITH SCHEMABINDING
AS
RETURN
    SELECT
        Result.id,
        Result.string
    FROM dbo.Example_GetBestTrigrams(@column_name, @search) AS GBT
    CROSS APPLY
    (
        -- Trigram search
        SELECT
            E.id,
            string =
                CASE @column_name
                    WHEN 'string1' THEN E.string1
                    WHEN 'string2' THEN E.string2
                    WHEN 'string3' THEN E.string3
                    ELSE NULL
                END
        FROM dbo.Example_GetTrigramMatchIDs
            (@column_name, GBT.trigram1, GBT.trigram2, GBT.trigram3) AS MID
        JOIN dbo.Example AS E
            ON E.id = MID.id
        WHERE
            -- Residual exact match check
            CASE @column_name
                WHEN 'string1' THEN E.string1
                WHEN 'string2' THEN E.string2
                WHEN 'string3' THEN E.string3
                ELSE NULL
            END LIKE @search
            -- At least one trigram found 
            AND GBT.trigram1 IS NOT NULL
 
        UNION ALL
 
        -- Non-trigram search
        SELECT
            E.id,
            string = 
                CASE @column_name
                    WHEN 'string1' THEN E.string1
                    WHEN 'string2' THEN E.string2
                    WHEN 'string3' THEN E.string3
                    ELSE NULL
                END
        FROM dbo.Example AS E
        WHERE
            CASE @column_name
                WHEN 'string1' THEN E.string1
                WHEN 'string2' THEN E.string2
                WHEN 'string3' THEN E.string3
                ELSE NULL
            END LIKE @search
            -- No trigram found 
            AND GBT.trigram1 IS NULL
    ) AS Result;

That particular implementation choice allows trigram searches with the column name as a parameter:

SELECT
    ETS.id,
    ETS.string
FROM dbo.Example_TrigramSearch
(
    N'string1', 
    '%12345%'
) AS ETS;

SELECT
    ETS.id,
    ETS.string
FROM dbo.Example_TrigramSearch
(
    N'string2', 
    '%ABC12%'
) AS ETS;

SELECT
    ETS.id,
    ETS.string
FROM dbo.Example_TrigramSearch
(
    N'string3', 
    '%ABC45%'
) AS ETS;

Related Solutions

Sql-server – LIKE uses index, CHARINDEX does not

How does LIKE '%123456789%' benefit from indexing?

Only a little bit. The query processor can scan the whole nonclustered index looking for matches instead of the entire table (the clustered index). Nonclustered indexes are generally smaller than the table they are built on, so scanning the nonclustered index may be faster.

The downside, is that any columns needed by the query that are not included in the nonclustered index definition must be looked up in the base table, per row.

The optimizer makes a decision between scanning the table (clustered index) and scanning the nonclustered index with lookups, based on cost estimates. The estimated costs depend to a great extent on how many rows the optimizer expects your LIKE or CHARINDEX predicate to select.

Why do the listed articles state that it will not improve performance?

For a LIKE condition that does not start with a wildcard, SQL Server can perform a partial scan of the index instead of scanning the whole thing. For example, LIKE 'A% can be correctly evaluated by testing only index records >= 'A' and < 'B' (the exact boundary values depend on collation).

This sort of query can use the seeking ability of b-tree indexes: we can go straight to the first record >= 'A' using the b-tree, then scan forward in index key order until we reach a record that fails the < 'B' test. Since we only need to apply the LIKE test to a smaller number of rows, performance is generally better.

By contrast, LIKE '%A cannot be turned into a partial scan because we don't know where to start or end; any record could end in 'A', so we cannot improve on scanning the whole index and testing every row individually.

I tried rewriting the query to use CHARINDEX, but performance is still slow. Why does CHARINDEX not benefit from the indexing as it appears the LIKE query does?

The query optimizer has the same choice between scanning the table (clustered index) and scanning the nonclustered index (with lookups) in both cases.

The choice is made between the two based on cost estimation. It so happens that SQL Server may produce a different estimate for the two methods. For the LIKE form of the query, the estimate may be able to use special string statistics to produce a reasonably accurate estimate. The CHARINDEX > 0 form produces an estimate based on a guess.

The different estimates are enough to make the optimizer choose a Clustered Index Scan for CHARINDEX and a NonClustered Index Scan with Lookups for the LIKE. If you force the CHARINDEX query to use the nonclustered index with a hint, you will get the same plan as for LIKE, and performance will be about the same:

SELECT
    [Customer name],
    [Sl_No],
    [Id]
FROM dbo.customer WITH (INDEX (f))
WHERE 
    CHARINDEX('9000413237', [Phone no]) >0;

The number of rows processed at runtime will be the same for both methods, it's just that the LIKE form produces a more accurate estimation in this case, so the query optimizer chooses a better plan.

If you find yourself needing LIKE %thing% searches often, you might want to consider a technique I wrote about in Trigram Wildcard String Search in SQL Server.

Sql-server – Simple select is taking too long

If you only need to know that a row is distinct, and don't need the actual contents of col3, then perhaps returning the hash of col3 would speed up the query?

You could even perhaps pre-compute the hash using a calculated column so that you aren't computing the hash on the fly.

If you do need the contents of col3, but have a lot of duplicates of col1+col2+col3, then it still may be beneficial to work with the hash to remove duplicates as a sub-query, then only return the col3 contents for the distinct rows.

Best Answer

Example

Related Solutions

Sql-server – LIKE uses index, CHARINDEX does not

Sql-server – Simple select is taking too long

Related Question