Sql-server – Compare strings from two tables SQL Server 2014

join;sql serversql server 2014string-searching

I'm using these two tables in a SQL Server 2014 database:

CREATE TABLE #FORMULAS 
(
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [DataSourceAddress] [nvarchar](255) NULL,
    [Description] [nvarchar](255) NULL
)

CREATE INDEX [IX_FORMULAS] 
ON #FORMULAS ([DataSourceAddress] ASC)

CREATE TABLE #PARAMETERS
(
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [Datetime] DateTime NULL,
    [ParameterName] [nvarchar](255) NULL,
    [Value] float NULL
)

CREATE INDEX [IX_PARAMETERS] 
ON #PARAMETERS([ParameterName] ASC)

INSERT INTO #FORMULAS ([DataSourceAddress], [Description])
VALUES ('(parameter1+parameter2+parameter3)/3', 'Formula description 1'),
       ('parameter4', 'Formula description 2'),
       ('(parameter5*1000)', 'Formula description 3');

INSERT INTO #PARAMETERS ([Datetime], [ParameterName], [Value])
VALUES ('2016-01-26 11:40:00.000', 'parameter1', 1),
       ('2016-01-26 11:40:00.000', 'parameter2', 2),
       ('2016-01-26 11:40:00.000', 'parameter3', 3),
       ('2016-01-26 11:40:00.000', 'parameter4', 4),
       ('2016-01-26 11:40:00.000', 'parameter5', 5),
       ('2016-01-26 11:40:00.000', 'parameter6', 6),
       ('2016-01-26 11:44:00.000', 'parameter1', 1),
       ('2016-01-26 11:44:00.000', 'parameter2', 2),
       ('2016-01-26 11:44:00.000', 'parameter3', 3),
       ('2016-01-26 11:44:00.000', 'parameter4', 4),
       ('2016-01-26 11:44:00.000', 'parameter5', 5),
       ('2016-01-26 11:44:00.000', 'parameter6', 6);

I want to select the parameters of the table #FORMULAS that match with the column [ParameterName] of the table #PARAMETERS.

For example, take the value '(parameter1+parameter2+parameter3)/3' of the table #FORMULAS and receive all the rows of the table #PARAMETERS that contains 'parameter1','parameter2','parameter3'.

I'm using the following select:

SELECT  
    A.[Datetime],
    A.[ParameterName],
    B.[Description],
    A.[Value]
FROM   
    #PARAMETERS A
JOIN 
    #FORMULAS B ON CHARINDEX(A.[ParameterName], B.[DataSourceAddress]) > 0

The #PARAMETERS table has more than 160000 rows (with 1174 unique ParameterNames), #FORMULAS has 3283 rows. The query takes a lot of time to finish, more than 10 minutes.

Is there a faster way to do the query?

Best Answer

Wildcard matching strings with good performance will always be tricky, but here's a best-effort attempt.

First off, your indexing can be improved. The index on #PARAMETERS only includes the actual ParameterName column, which is fine if you need a list of parameters, but whenever you need any other column, the index will be useless to SQL Server, and it'll revert to scanning the entire table instead. I would propose fixing this by changing the index to a clustered index:

CREATE UNIQUE CLUSTERED INDEX [IX_PARAMETERS]
    ON #PARAMETERS (ParameterName, [Datetime])
    WITH (DROP_EXISTING=ON);     -- Remove this row if the index doesn't already exist.

Clustered indexes include every column in the table, which automatically makes a clustered index a covering indexes.

From here, I've updated your query to do three things:

  • Collect each unique parameter in #PARAMETERS.
  • For each unique parameter, find matching formulas in #FORMULAS. This is the expensive part, and no index will really improve this search because the search condition (LIKE) is not sargable. This means that, index or not, SQL Server will go through each and every row in the table (a so-called scan), for every parameter.
  • Finally, join #PARAMETERS to get all the dates and values. This is a relatively cheap and efficient operation. This is where the modified index on #PARAMETERS helps.

Here's the resulting query:

SELECT  A.[Datetime],
        p.[ParameterName],
        B.[Description],
        A.[Value]
FROM (SELECT DISTINCT ParameterName FROM #PARAMETERS) p
CROSS APPLY (
    SELECT *
    FROM #FORMULAS B WITH (FORCESCAN)
    WHERE B.DataSourceAddress LIKE N'%'+p.ParameterName+N'%') AS B
INNER JOIN #PARAMETERS AS A ON
    p.ParameterName=A.ParameterName;

I added WITH (FORCESCAN) to eliminate key lookups. You can try the query with or without this hint.

If I had a lot more time to work on this and it was a super-critical query, I would probably (a) remodel the tables and/or (b) consider storing the data in #FORMULAS as XML data and add XML indexes, which would eliminate the LIKE match.

Footnote: Text matching like you do here (LIKE and CHARINDEX are pretty much the same here) will include formulas with the parameters "parameter10" and "parameter11" even when you're just searching for "parameter1".