Sql-server – Counting occurrence of words in table is slow

likeperformancequery-performancesql servert-sql

Consider these simplified tables:

CREATE TABLE dbo.words
(
    id bigint NOT NULL IDENTITY (1, 1),
    word varchar(32) NOT NULL,
    hits int NULL
)

CREATE TABLE dbo.items
(
    id bigint NOT NULL IDENTITY (1, 1),
    body varchar(256) NOT NULL,
)

The words table holds about 9000 records, each holding a single word ('phone','sofa','house','dog', …)
The items table holds around 12000 records, each with a body text of no more then 256 characters.

Now, I need to update the words table, counting how many records there are in the items table that hold (at least once) the text in the word field. I need to account for partial words, so all these 4 records should be counted for the word dog:

'This is my dog'  
'I really like the movie dogma'  
'my cousin has sheepdogs'  
'dog dog dog doggerdy dog dog'

That last example should count as just one record (contains the term 'dog' at least once).

I can use this query:

UPDATE dbo.words
SET hits = (SELECT COUNT(*) FROM dbo.items WHERE body like '%' + word + '%')

But, this is extremely slow, this will take over 10 minutes to complete on the not to heavy server I have for it.

AFAIK indexes won't help, I'm doing LIKE searches.
I also think full-text won't help me, since I'm looking for words starting, ending or containing my search term. I could be wrong here.

Any advise on how to speed this up?

Best Answer

The best way I have found to speed up leading wildcard LIKE searches is to use n-grams. I describe the technique and provide a sample implementation in Trigram Wildcard String Search in SQL Server.

The basic idea of a trigram search is quite simple:

  1. Persist three-character substrings (trigrams) of the target data.
  2. Split the search term(s) into trigrams.
  3. Match search trigrams against the stored trigrams (equality search).
  4. Intersect the qualified rows to find strings that match all trigrams.
  5. Apply the original search filter to the much-reduced intersection.

It may be suitable for your needs, but be aware:

Trigram search is no panacea. The extra storage requirements, implementation complexity, and impact on update performance all weigh heavily against it.

Test

I ran a quick test using the Complete Works of Shakespeare to populate the body column of the items table with 15,838 rows. I loaded the words table with 7,669 unique words from the same text.

The trigram structures built in around 2 seconds and the following update statement completed in 5 seconds on my mid-range laptop:

UPDATE dbo.words WITH (TABLOCK)
SET hits = 
(
    SELECT COUNT_BIG(*) 
    FROM dbo.Items_TrigramSearch
        ('%' + word +'%') AS ITS
);

A selection of the updated words table:

sample

The modified trigram scripts from my article are below:

CREATE FUNCTION dbo.GenerateTrigrams (@string varchar(255))
RETURNS table
WITH SCHEMABINDING
AS RETURN
    WITH
        N16 AS 
        (
            SELECT V.v 
            FROM 
            (
                VALUES 
                    (0),(0),(0),(0),(0),(0),(0),(0),
                    (0),(0),(0),(0),(0),(0),(0),(0)
            ) AS V (v)),
        -- Numbers table (256)
        Nums AS 
        (
            SELECT n = ROW_NUMBER() OVER (ORDER BY A.v)
            FROM N16 AS A 
            CROSS JOIN N16 AS B
        ),
        Trigrams AS
        (
            -- Every 3-character substring
            SELECT TOP (CASE WHEN LEN(@string) > 2 THEN LEN(@string) - 2 ELSE 0 END)
                trigram = SUBSTRING(@string, N.n, 3)
            FROM Nums AS N
            ORDER BY N.n
        )
    -- Remove duplicates and ensure all three characters are alphanumeric
    SELECT DISTINCT 
        T.trigram
    FROM Trigrams AS T
    WHERE
        -- Binary collation comparison so ranges work as expected
        T.trigram COLLATE Latin1_General_BIN2 NOT LIKE '%[^A-Z0-9a-z]%';
GO
-- Trigrams for items table
CREATE TABLE dbo.ItemsTrigrams
(
    id integer NOT NULL,
    trigram char(3) NOT NULL
);
GO
-- Generate trigrams
INSERT dbo.ItemsTrigrams WITH (TABLOCKX)
    (id, trigram)
SELECT
    E.id,
    GT.trigram
FROM dbo.items AS E
CROSS APPLY dbo.GenerateTrigrams(E.body) AS GT;
GO
-- Trigram search index
CREATE UNIQUE CLUSTERED INDEX
    [CUQ dbo.ItemsTrigrams (trigram, id)]
ON dbo.ItemsTrigrams (trigram, id)
WITH (DATA_COMPRESSION = ROW);
GO
-- Selectivity of each trigram (performance optimization)
CREATE OR ALTER VIEW dbo.ItemsTrigramCounts
WITH SCHEMABINDING
AS
SELECT ET.trigram, cnt = COUNT_BIG(*)
FROM dbo.ItemsTrigrams AS ET
GROUP BY ET.trigram;
GO
-- Materialize the view
CREATE UNIQUE CLUSTERED INDEX
    [CUQ dbo.ItemsTrigramCounts (trigram)]
ON dbo.ItemsTrigramCounts (trigram);
GO
-- Most selective trigrams for a search string
-- Always returns a row (NULLs if no trigrams found)
CREATE FUNCTION dbo.Items_GetBestTrigrams (@string varchar(255))
RETURNS table
WITH SCHEMABINDING AS
RETURN
    SELECT
        -- Pivot
        trigram1 = MAX(CASE WHEN BT.rn = 1 THEN BT.trigram END),
        trigram2 = MAX(CASE WHEN BT.rn = 2 THEN BT.trigram END),
        trigram3 = MAX(CASE WHEN BT.rn = 3 THEN BT.trigram END)
    FROM 
    (
        -- Generate trigrams for the search string
        -- and choose the most selective three
        SELECT TOP (3)
            rn = ROW_NUMBER() OVER (
                ORDER BY ETC.cnt ASC),
            GT.trigram
        FROM dbo.GenerateTrigrams(@string) AS GT
        JOIN dbo.ItemsTrigramCounts AS ETC
            WITH (NOEXPAND)
            ON ETC.trigram = GT.trigram
        ORDER BY
            ETC.cnt ASC
    ) AS BT;
GO
-- Returns Example ids matching all provided (non-null) trigrams
CREATE FUNCTION dbo.Items_GetTrigramMatchIDs
(
    @Trigram1 char(3),
    @Trigram2 char(3),
    @Trigram3 char(3)
)
RETURNS @IDs table (id integer PRIMARY KEY)
WITH SCHEMABINDING AS
BEGIN
    IF  @Trigram1 IS NOT NULL
    BEGIN
        IF @Trigram2 IS NOT NULL
        BEGIN
            IF @Trigram3 IS NOT NULL
            BEGIN
                -- 3 trigrams available
                INSERT @IDs (id)
                SELECT ET1.id
                FROM dbo.ItemsTrigrams AS ET1 
                WHERE ET1.trigram = @Trigram1
                INTERSECT
                SELECT ET2.id
                FROM dbo.ItemsTrigrams AS ET2
                WHERE ET2.trigram = @Trigram2
                INTERSECT
                SELECT ET3.id
                FROM dbo.ItemsTrigrams AS ET3
                WHERE ET3.trigram = @Trigram3
                OPTION (MERGE JOIN);
            END;
            ELSE
            BEGIN
                -- 2 trigrams available
                INSERT @IDs (id)
                SELECT ET1.id
                FROM dbo.ItemsTrigrams AS ET1 
                WHERE ET1.trigram = @Trigram1
                INTERSECT
                SELECT ET2.id
                FROM dbo.ItemsTrigrams AS ET2
                WHERE ET2.trigram = @Trigram2
                OPTION (MERGE JOIN);
            END;
        END;
        ELSE
        BEGIN
            -- 1 trigram available
            INSERT @IDs (id)
            SELECT ET1.id
            FROM dbo.ItemsTrigrams AS ET1 
            WHERE ET1.trigram = @Trigram1;
        END;
    END;

    RETURN;
END;
GO
-- Search implementation
CREATE FUNCTION dbo.Items_TrigramSearch
(
    @Search varchar(255)
)
RETURNS table
WITH SCHEMABINDING
AS
RETURN
    SELECT
        Result.body
    FROM dbo.Items_GetBestTrigrams(@Search) AS GBT
    CROSS APPLY
    (
        -- Trigram search
        SELECT
            E.id,
            E.body
        FROM dbo.Items_GetTrigramMatchIDs
            (GBT.trigram1, GBT.trigram2, GBT.trigram3) AS MID
        JOIN dbo.Items AS E
            ON E.id = MID.id
        WHERE
            -- At least one trigram found 
            GBT.trigram1 IS NOT NULL
            AND E.body LIKE @Search

        UNION ALL

        -- Non-trigram search
        SELECT
            E.id,
            E.body
        FROM dbo.Items AS E
        WHERE
            -- No trigram found 
            GBT.trigram1 IS NULL
            AND E.body LIKE @Search
    ) AS Result;

The only other change was to add a clustered index to the items table:

CREATE UNIQUE CLUSTERED INDEX cuq ON dbo.items (id);