Efficient Pattern Matching for Single Word in SQL Server

pattern matchingsql serversql-server-2016string manipulation

I'm creating a function to find out what search pattern would best fit a single string. The string to match will be compared against a table of what could be possibly hundreds of possible patterns to see if it matches any of them.

I've got a not-so-elegant function working right now that will read all patterns into a memory table and iterate though them to see if each matches. It works but makes me sick to my stomach to think of using this in a production environment, especially if the pattern table grows to thousands of patterns.

Example string to match: 'Master'

Possible patterns:

m%r
abc%
xyz%
a%bcd

The outcome would be a match to pattern #1.

In the scenario, I need to be able to match potential part numbers, brands, and other possibly misspelled phrases. I'm incorporating Double Metaphone but when it comes to other items that only a pattern will catch, I need to know what pattern was the match among many. The "StringToSearch" would only return 1 matched pattern.

Does anyone have theory or other code types that can help me with this reverse pattern matching process?

Best Answer

Without knowing anything about what your actual scenario looks like, I'd perhaps do something like:

USE tempdb;

IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
    TargetString varchar(46) NOT NULL
);

IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
    PossibleMatch varchar(46) NOT NULL
);

INSERT INTO dbo.Strings(TargetString)
VALUES ('Master');

INSERT INTO dbo.Matches(PossibleMatch)
VALUES ('m%r')
    , ('abc%')
    , ('zyx%')
    , ('a%bcd')

SELECT *
FROM dbo.Strings s
    INNER JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch

╔══════════════╦═══════════════╗
║ TargetString ║ PossibleMatch ║
╠══════════════╬═══════════════╣
║ Master       ║ m%r           ║
╚══════════════╩═══════════════╝

You'll get a better set of answers if you more accurately describe exactly what your scenario is by explaining how the data is designed, and what you exact design goals are.

To see if this methodology is effective enough, I've extended the MCVE above by adding a couple of indexes and a bunch of data:

The tables now have clustered indexes:

IF OBJECT_ID(N'dbo.Strings', N'U') IS NOT NULL
DROP TABLE dbo.Strings;
CREATE TABLE dbo.Strings
(
    TargetString varchar(46) NOT NULL
        PRIMARY KEY CLUSTERED
);

IF OBJECT_ID(N'dbo.Matches', N'U') IS NOT NULL
DROP TABLE dbo.Matches;
CREATE TABLE dbo.Matches
(
    PossibleMatch varchar(46) NOT NULL
        PRIMARY KEY CLUSTERED
);

This inserts 50,000 words into the Strings table:

;WITH src AS (
    SELECT v.val
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS 
(
    SELECT TOP(26) Num = (v.val * 10) + v1.val
    FROM src v
        CROSS JOIN src v1
    ORDER BY v.val, v1.val
)
, words AS
(
    SELECT c1 = CHAR(65 + s1.num)
        , c2 = CHAR(65 + s2.num)
        , c3 = CHAR(65 + s3.num)
        , c4 = CHAR(65 + s4.num)
        , c5 = CHAR(65 + s5.num)
    FROM src26 s5
        CROSS JOIN src26 s4
        CROSS JOIN src26 s3
        CROSS JOIN src26 s2
        CROSS JOIN src26 s1
)
INSERT INTO dbo.Strings(TargetString)
SELECT TOP(50000) words.c1
    + words.c2
    + words.c3
    + words.c4
    + words.c5
FROM words
ORDER BY CRYPT_GEN_RANDOM(5);

A sample of the rows from that table:

╔══════════════╗
║ TargetString ║
╠══════════════╣
║ UKMBL        ║
║ ASOCG        ║
║ XWACG        ║
║ NHRXQ        ║
║ LFMBR        ║
║ EQELO        ║
║ PRHCM        ║
║ IIFPD        ║
║ OCLJQ        ║
║ NJBMB        ║
╚══════════════╝

Next, we'll create 1,000,000 rows in the Matches table with wildcards:

;WITH src AS (
    SELECT v.val
    FROM (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9))v(val)
)
, src26 AS 
(
    SELECT TOP(26) Num = (v.val * 10) + v1.val
    FROM src v
        CROSS JOIN src v1
    ORDER BY v.val, v1.val
)
, words AS
(
    SELECT c1 = CHAR(65 + s1.num)
        , c2 = CHAR(65 + s2.num)
        , c3 = CHAR(65 + s3.num)
        , c4 = CHAR(65 + s4.num)
        , c5 = CHAR(65 + s5.num)
    FROM src26 s5
        CROSS JOIN src26 s4
        CROSS JOIN src26 s3
        CROSS JOIN src26 s2
        CROSS JOIN src26 s1
)
INSERT INTO dbo.Matches(PossibleMatch)
SELECT TOP(1000000) REPLACE(words.c1, 'A', '_')
    + REPLACE(words.c2, 'E', '_')
    + REPLACE(words.c3, 'I', '_')
    + REPLACE(words.c4, 'O', '_')
    + REPLACE(words.c5, 'U', '%')
FROM words
WHERE NOT (words.c1 = 'A' AND words.c2 = 'E' AND words.c3 = 'I' AND words.c4 ='O' AND words.c5 <> 'U');

SELECT TotalMatches = (SELECT COUNT(1) FROM dbo.Matches)
    , TotalStrings = (SELECT COUNT(1) FROM dbo.Strings)

Here's a sample of those rows:

╔═══════════════╗
║ PossibleMatch ║
╠═══════════════╣
║ LDBFB         ║
║ IWLMB         ║
║ FNBUA         ║
║ _ZEVC         ║
║ BSZDB         ║
║ TTRIB         ║
║ XVLNA         ║
║ VLPDB         ║
║ UI_IA         ║
║ FNLCB         ║
╚═══════════════╝

As you can see from the sample above, some of the rows have a T-SQL single-character wildcard, _, and some rows don't have wildcards.

Now, we'll run some code to see if we can find a row from Matches that matches up to a row in Strings:

SELECT *
FROM dbo.Strings s
    LEFT JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
WHERE s.TargetString = (
    SELECT TOP(1) Strings.TargetString
    FROM dbo.Strings 
    ORDER BY CRYPT_GEN_RANDOM(5)
    )

The WHERE clause randomly chooses a word from the Strings table. You may need to run this several times before you get a word with matching values in the Matches table.

Looking up any given word in the Strings table, with associated Matches:

SELECT *
FROM dbo.Strings s
    LEFT JOIN dbo.Matches m ON s.TargetString LIKE m.PossibleMatch
WHERE s.TargetString = 'JRZPC'

The plan for this query:

On my system, this query executes in ~100 milliseconds.

The results of that query:

╔══════════════╦═══════════════╗
║ TargetString ║ PossibleMatch ║
╠══════════════╬═══════════════╣
║ JRZPC        ║ ___PC         ║
║ JRZPC        ║ __Z_C         ║
║ JRZPC        ║ __ZPC         ║
║ JRZPC        ║ _R__C         ║
║ JRZPC        ║ _R_PC         ║
║ JRZPC        ║ _RZ_C         ║
║ JRZPC        ║ _RZPC         ║
╚══════════════╩═══════════════╝

If you were going to run hundreds or thousands of these queries per minute, you might want to re-think the strategy, but that's putting the horse before the cart.

If you want to implement this as a function, I'd consider this:

CREATE FUNCTION dbo.FindMatches
(
    @TargetString varchar(46)
)
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
    SELECT m.PossibleMatch
    FROM dbo.Matches m
    WHERE @TargetString LIKE m.PossibleMatch
);
GO

That code creates a schema-bound table-valued-function that gets "inlined" with the target query, and is about as efficient as you can get for this type of thing.

This would be an example of how you'd call the function:

SELECT s.TargetString
    , fm.PossibleMatch
FROM dbo.Strings s
    CROSS APPLY dbo.FindMatches(s.TargetString) fm
WHERE s.TargetString = 'AMLAA'

Related Solutions

MySQL select data over time that matches pattern over several rows

I do something similar on my side, and I use a MySQL Stored Procedure / Cursor.

You loop through each line of the table and record the current datetime entry and plant_status into variables, and then move on to the next line.

At which point you compare the current datetime and plant_status with the last ones.

If it matches a set condition (in your case new_status != last_status) you add an entry to a temporary table.

And then at the end you simply select * from temporary_table_name to show you all the rows where the status changed.

You could easily convert this to check the datetime as well (changing the old_datetime variable each time the status changes), and comparing that to the current one to find out if there is a two hour difference). (this probably should be a comment rather than an answer but I don't yet have enough points to comment - sorry)

Sql-server – Find and return items based on matching strings across 17 tables

Is the application going to search for these one by one? Is the performance of this query important? Easy (ish) solution would be to create a stored procedure that accepted some arguments and then returned the results from searching each table.

Results from the stored procedure could include the table it found a match on, the primary key and the field that matched. Then the caller of that stored procedure could do more work based on that result set.

EDIT - Below is a very basic sample procedure for you.

CREATE PROCEDURE dbo.sp_FindCircuit
    (
    @Search NVARCHAR(100)
    )
AS
BEGIN

--Hold Results.
DECLARE @Results TABLE
    (
    TableName VARCHAR(100) NULL
    , FieldName VARCHAR(100) NULL
    , KeyValue INT NULL
    )

INSERT INTO @Results
        (TableName, FieldName, KeyValue)
SELECT 'Table 5'
    , 'Circuit ID'
    , <PrimaryKey>
FROM [Table 5]
WHERE CircuitID LIKE ('%' + @Search + '%')

--Repeat for each table.


--Return data.
SELECT TableName, FieldName, KeyValue
FROM @Results 

END

Best Answer

Related Solutions

MySQL select data over time that matches pattern over several rows

Sql-server – Find and return items based on matching strings across 17 tables

Related Question