Efficient Pattern Matching for Single Word in SQL Server

pattern matchingsql serversql-server-2016string manipulation

I'm creating a function to find out what search pattern would best fit a single string. The string to match will be compared against a table of what could be possibly hundreds of possible patterns to see if it matches any of them.

I've got a not-so-elegant function working right now that will read all patterns into a memory table and iterate though them to see if each matches. It works but makes me sick to my stomach to think of using this in a production environment, especially if the pattern table grows to thousands of patterns.

Example string to match: 'Master'

Possible patterns:

  1. m%r
  2. abc%
  3. xyz%
  4. a%bcd

The outcome would be a match to pattern #1.

In the scenario, I need to be able to match potential part numbers, brands, and other possibly misspelled phrases. I'm incorporating Double Metaphone but when it comes to other items that only a pattern will catch, I need to know what pattern was the match among many. The "StringToSearch" would only return 1 matched pattern.

Does anyone have theory or other code types that can help me with this reverse pattern matching process?

Best Answer

Without knowing anything about what your actual scenario looks like, I'd perhaps do something like:

USE tempdb;

IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
    TargetString varchar(46) NOT NULL
);

IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
    PossibleMatch varchar(46) NOT NULL
);

INSERT INTO dbo.Strings(TargetString)
VALUES ('Master');

INSERT INTO dbo.Matches(PossibleMatch)
VALUES ('m%r')
    , ('abc%')
    , ('zyx%')
    , ('a%bcd')

SELECT *
FROM dbo.Strings s
    INNER JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
╔══════════════╦═══════════════╗
║ TargetString ║ PossibleMatch ║
╠══════════════╬═══════════════╣
║ Master       ║ m%r           ║
╚══════════════╩═══════════════╝

You'll get a better set of answers if you more accurately describe exactly what your scenario is by explaining how the data is designed, and what you exact design goals are.


To see if this methodology is effective enough, I've extended the MCVE above by adding a couple of indexes and a bunch of data:

The tables now have clustered indexes:

IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
    TargetString varchar(46) NOT NULL
        PRIMARY KEY CLUSTERED
);

IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
    PossibleMatch varchar(46) NOT NULL
        PRIMARY KEY CLUSTERED
);

This inserts 50,000 words into the Strings table:

;WITH src AS (
    SELECT v.val
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS 
(
    SELECT TOP(26) Num = (v.val * 10) + v1.val
    FROM src v
        CROSS JOIN src v1
    ORDER BY v.val, v1.val
)
, words AS
(
    SELECT c1 = CHAR(65 + s1.num)
        , c2 = CHAR(65 + s2.num)
        , c3 = CHAR(65 + s3.num)
        , c4 = CHAR(65 + s4.num)
        , c5 = CHAR(65 + s5.num)
    FROM src26 s5
        CROSS JOIN src26 s4
        CROSS JOIN src26 s3
        CROSS JOIN src26 s2
        CROSS JOIN src26 s1
)
INSERT INTO dbo.Strings(TargetString)
SELECT TOP(50000) words.c1
    + words.c2
    + words.c3
    + words.c4
    + words.c5
FROM words
ORDER BY CRYPT_GEN_RANDOM(5);

A sample of the rows from that table:

╔══════════════╗
║ TargetString ║
╠══════════════╣
║ UKMBL        ║
║ ASOCG        ║
║ XWACG        ║
║ NHRXQ        ║
║ LFMBR        ║
║ EQELO        ║
║ PRHCM        ║
║ IIFPD        ║
║ OCLJQ        ║
║ NJBMB        ║
╚══════════════╝

Next, we'll create 1,000,000 rows in the Matches table with wildcards:

;WITH src AS (
    SELECT v.val
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS 
(
    SELECT TOP(26) Num = (v.val * 10) + v1.val
    FROM src v
        CROSS JOIN src v1
    ORDER BY v.val, v1.val
)
, words AS
(
    SELECT c1 = CHAR(65 + s1.num)
        , c2 = CHAR(65 + s2.num)
        , c3 = CHAR(65 + s3.num)
        , c4 = CHAR(65 + s4.num)
        , c5 = CHAR(65 + s5.num)
    FROM src26 s5
        CROSS JOIN src26 s4
        CROSS JOIN src26 s3
        CROSS JOIN src26 s2
        CROSS JOIN src26 s1
)
INSERT INTO dbo.Matches(PossibleMatch)
SELECT TOP(1000000) REPLACE(words.c1, 'A', '_')
    + REPLACE(words.c2, 'E', '_')
    + REPLACE(words.c3, 'I', '_')
    + REPLACE(words.c4, 'O', '_')
    + REPLACE(words.c5, 'U', '%')
FROM words
WHERE NOT (words.c1 = 'A' AND words.c2 = 'E' AND words.c3 = 'I' AND words.c4 ='O' AND words.c5 <> 'U');

SELECT TotalMatches = (SELECT COUNT(1) FROM dbo.Matches)
    , TotalStrings = (SELECT COUNT(1) FROM dbo.Strings)

Here's a sample of those rows:

╔═══════════════╗
║ PossibleMatch ║
╠═══════════════╣
║ LDBFB         ║
║ IWLMB         ║
║ FNBUA         ║
║ _ZEVC         ║
║ BSZDB         ║
║ TTRIB         ║
║ XVLNA         ║
║ VLPDB         ║
║ UI_IA         ║
║ FNLCB         ║
╚═══════════════╝

As you can see from the sample above, some of the rows have a T-SQL single-character wildcard, _, and some rows don't have wildcards.

Now, we'll run some code to see if we can find a row from Matches that matches up to a row in Strings:

SELECT *
FROM dbo.Strings s
    LEFT JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
WHERE s.TargetString = (
    SELECT TOP(1) Strings.TargetString
    FROM dbo.Strings 
    ORDER BY CRYPT_GEN_RANDOM(5)
    )

The WHERE clause randomly chooses a word from the Strings table. You may need to run this several times before you get a word with matching values in the Matches table.

Looking up any given word in the Strings table, with associated Matches:

SELECT *
FROM dbo.Strings s
    LEFT JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
WHERE s.TargetString = 'JRZPC'

The plan for this query:

enter image description here

On my system, this query executes in ~100 milliseconds.

The results of that query:

╔══════════════╦═══════════════╗
║ TargetString ║ PossibleMatch ║
╠══════════════╬═══════════════╣
║ JRZPC        ║ ___PC         ║
║ JRZPC        ║ __Z_C         ║
║ JRZPC        ║ __ZPC         ║
║ JRZPC        ║ _R__C         ║
║ JRZPC        ║ _R_PC         ║
║ JRZPC        ║ _RZ_C         ║
║ JRZPC        ║ _RZPC         ║
╚══════════════╩═══════════════╝

If you were going to run hundreds or thousands of these queries per minute, you might want to re-think the strategy, but that's putting the horse before the cart.

If you want to implement this as a function, I'd consider this:

CREATE FUNCTION dbo.FindMatches
(
    @TargetString varchar(46)
)
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
    SELECT m.PossibleMatch
    FROM dbo.Matches m
    WHERE @TargetString LIKE m.PossibleMatch
);
GO

That code creates a schema-bound table-valued-function that gets "inlined" with the target query, and is about as efficient as you can get for this type of thing.

This would be an example of how you'd call the function:

SELECT s.TargetString
    , fm.PossibleMatch
FROM dbo.Strings s
    CROSS APPLY dbo.FindMatches(s.TargetString) fm
WHERE s.TargetString = 'AMLAA'