SQL Server – Check if a String Contains a Number Exclusively

pattern matchingsql serversql-server-2005substringt-sql

I have been trying to write a function to check if a string contains a number without that number being part of a bigger number (in other words if the number being searched for is '6' and the string is '7+16+2' it should return false, because the '6' in this string is part of the number '16')

I wrote the function below (It is lengthy but I was going to test it first before refactoring it)

Upon testing I found a bug whereby it only runs the first instance of a found number through the logic. So running this function with '6' against '16+7+9+6' would return false because it determines that the firrst '6' is part of a bigger number and stops processing.

I figured that to get around this I have to implement a loop to shorten the 'haystack' string (so that, using the example '16+7+9+6' the function continues checking '+7+9+6' after eliminating the first '6') but before spending the time to make an already convoluted function even more convoluted I wanted to check if there is a simpler way to acheive the same goal?

drop function dbo.runners_contain_runner
go
create function dbo.runners_contain_runner(@runner varchar(max), @runners varchar(max))
returns int
as 
begin

    /*
    eliminate the plus sign from @runners so that the 
    'isnumeric' function doesn't return false positives (it returns 1 for '+') 
    */
    set @runners = replace(@runners,'+','_' ) 


    declare @ret int;
    set @ret = 0;

    -- if the runner is the only runner return 1
    if @runners = @runner 
        set @ret = 1
    else
    begin
        declare @charindex int;
        set @charindex = charindex(@runner,@runners)
        if @charindex > 0
        begin

            -- if it is at the beginning then check the char after it 
            if @charindex = 1 
            begin
                if isnumeric(substring(@runners,@charindex + len(@runner),1)) = 0
                    set @ret = @charindex
            end

            -- if it is at the end then check the char before it
            else if @charindex = len(@runners) - (len(@runner) - 1) 
            begin
                if isnumeric(substring(@runners,@charindex - 1,1)) = 0
                    set @ret = @charindex
            end

            -- if it is in the middle check the chars either side of it
            else
            begin
                if isnumeric(substring(@runners,@charindex - 1,1)) + 
                isnumeric(substring(@runners,@charindex + len(@runner),1)) = 0
                    set @ret = @charindex 
            end
        end
    end
    return @ret
end

Best Answer

Perhaps you are over-complicating this by focusing too much on wanting a number. Take a step back for a moment. What you actually want is a substring without any digits on either side of it. The only way a number could be part of a larger number is to have at least 1 digit on either side of it, right? So as long as you only pass in numbers, then this definition should still produce numbers that do not have any digits on either side.

With that in mind, we just need 3 PATINDEX predicates to cover the passed-in value being on the far left, on the far right, or in the middle. Try the following as it seems to work:

GO
CREATE PROCEDURE #TestFindRunner
(
  @Runner VARCHAR(10)
)
AS
SET NOCOUNT ON;

DECLARE @Data TABLE
(
  [ID] INT NOT NULL PRIMARY KEY,
  [Runners] VARCHAR(50) NULL
 );

INSERT INTO @Data ([ID], [Runners]) VALUES (1, '16+7+9+6');
INSERT INTO @Data ([ID], [Runners]) VALUES (2, '16+7+9+5');
INSERT INTO @Data ([ID], [Runners]) VALUES (3, '26+77+9+5');
INSERT INTO @Data ([ID], [Runners]) VALUES (4, '6+3+45');
INSERT INTO @Data ([ID], [Runners]) VALUES (5, '63,808,111,92');
INSERT INTO @Data ([ID], [Runners]) VALUES (6, '1-7-9,6');
INSERT INTO @Data ([ID], [Runners]) VALUES (7, '1-6-9,7');
INSERT INTO @Data ([ID], [Runners]) VALUES (8, '1-7-9,63');
INSERT INTO @Data ([ID], [Runners]) VALUES (9, '1-63-9,7');
INSERT INTO @Data ([ID], [Runners]) VALUES (10, NULL);
INSERT INTO @Data ([ID], [Runners]) VALUES (11, '6');

SELECT tmp.*
FROM   @Data tmp
WHERE  @Runner COLLATE Latin1_General_100_BIN2 = tmp.[Runners]
OR     PATINDEX('%[^0123456789]' + @Runner COLLATE Latin1_General_100_BIN2,
                tmp.[Runners]) > 0
OR     PATINDEX(@Runner + '[^0123456789]%' COLLATE Latin1_General_100_BIN2,
                tmp.[Runners]) > 0
OR     PATINDEX('%[^0123456789]' + @Runner + '[^0123456789]%'
                COLLATE Latin1_General_100_BIN2, tmp.[Runners]) > 0
GO

And then test with:

EXEC #TestFindRunner 0;
EXEC #TestFindRunner 2;
EXEC #TestFindRunner 4;
EXEC #TestFindRunner 8;
EXEC #TestFindRunner 11;
-- 0 rows

EXEC #TestFindRunner 3;   -- 4
EXEC #TestFindRunner 77;  -- 3
EXEC #TestFindRunner 111; -- 5
-- 1 row

EXEC #TestFindRunner 5; -- 2 and 3
-- 2 rows

EXEC #TestFindRunner 1; -- 6, 7, 8, and 9
-- 4 rows

EXEC #TestFindRunner 6; -- 1, 4, 6, 7, and 11
-- 5 rows

EXEC #TestFindRunner 7; -- 1, 2, 6, 7, 8, and 9
-- 6 rows

EXEC #TestFindRunner 9; -- 1, 2, 3, 6, 7, 8, and 9
-- 7 rows

The reason for having 3 variations of the PATINDEX is that PATINDEX search patterns are not Regular Expressions (RegeEx), contrary to what many people say / think (same with LIKE patterns). PATINDEX and LIKE patterns do not have quantifiers, so it is not possible to specify that the [^0123456789] single character replacement should be "0 or more"; it is "one and only one; no more, no less".

Forcing the binary collation (i.e. the COLLATE Latin1_General_100_BIN2 after each @Runner reference) ensures that we are only dealing with these 10 decimal digits and not any other character that might be considered equivalent

To put the above logic into an Inline Table-Valued Function (TVF) so that it is easier to use (and more efficient than a similarly easy-to-use scalar UDF), try the following:

USE [tempdb];
GO
CREATE FUNCTION dbo.IsRunnerPresent
(
  @Runner VARCHAR(10),
  @Runners VARCHAR(8000)
)
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN

  SELECT CONVERT(BIT,
    CASE WHEN @Runner COLLATE Latin1_General_100_BIN2 = @Runners
           OR PATINDEX('%[^0123456789]' + @Runner
                       COLLATE Latin1_General_100_BIN2, @Runners) > 0
           OR PATINDEX(@Runner + '[^0123456789]%'
                       COLLATE Latin1_General_100_BIN2, @Runners) > 0
           OR PATINDEX('%[^0123456789]' + @Runner + '[^0123456789]%'
                       COLLATE Latin1_General_100_BIN2, @Runners) > 0
           THEN 1
         ELSE 0
      END) AS [RunnerFound];
GO

And then test with:

DECLARE @Runner VARCHAR(10);
SET @Runner = '6';


DECLARE @Data TABLE
(
  [ID] INT NOT NULL PRIMARY KEY,
  [Runners] VARCHAR(50) NULL
 );

INSERT INTO @Data ([ID], [Runners]) VALUES (1, '16+7+9+6');
INSERT INTO @Data ([ID], [Runners]) VALUES (2, '16+7+9+5');
INSERT INTO @Data ([ID], [Runners]) VALUES (3, '26+77+9+5');
INSERT INTO @Data ([ID], [Runners]) VALUES (4, '6+3+45');
INSERT INTO @Data ([ID], [Runners]) VALUES (5, '63,808,111,92');
INSERT INTO @Data ([ID], [Runners]) VALUES (6, '1-7-9,6');
INSERT INTO @Data ([ID], [Runners]) VALUES (7, '1-6-9,7');
INSERT INTO @Data ([ID], [Runners]) VALUES (8, '1-7-9,63');
INSERT INTO @Data ([ID], [Runners]) VALUES (9, '1-63-9,7');
INSERT INTO @Data ([ID], [Runners]) VALUES (10, NULL);
INSERT INTO @Data ([ID], [Runners]) VALUES (11, '6');

SELECT tmp.[ID],
       tmp.[Runners],
       fnd.[RunnerFound]
FROM   @Data tmp
CROSS APPLY dbo.IsRunnerPresentTVF(@Runner, tmp.[Runners]) fnd;

Which returns:

ID    Runners          RunnerFound
1     16+7+9+6         1
2     16+7+9+5         0
3     26+77+9+5        0
4     6+3+45           1
5     63,808,111,92    0
6     1-7-9,6          1
7     1-6-9,7          1
8     1-7-9,63         0
9     1-63-9,7         0
10    NULL             0
11    6                1

Related Solutions

Sql-server – T-SQL View — How to ‘pre-fetch’ schema using scalar function, then populate using table query

I'm wondering if these views could be re-written using CTEs, to somehow 'pre-fetch' the modified/pre-fixed schemas using the functions, and then query the tables with SELECT INTO to populate data into the view.

No, this isn't possible with a view. You could do something broadly along these lines with a multi-statement function (MSUDF), but:

This would still require a static schema definition for the table variable; and
Careful design would be required to avoid terrible performance; and
All potential base query predicates must be supplied as optional parameters

A multi-statement function would potentially materialize the entire 'view' result set in a table variable, before any predicates in the base query were applied. If the result is large, this overhead will likely be prohibitive.

You could push predicates into the MSUDF using parameters, but then the function body becomes a mess of conditional predicates of the form column = @value OR @value IS NULL, which would require OPTION (RECOMPILE) in the MSUDF to optimize well.

Note that even with the best possible outcome, the users of the view will have to change their queries from view references to MSUDF references (with all parameters specified, DEFAULT or NULL passed for those that are not needed). This may not be workable.

All that said, you need to be certain what is causing the performance problem in the first place. The question does not supply an example of a problematic query plan, so some of what follows is educated guesswork:

There are two features of the current scheme that leap out at me:

The use of ISNULL and calculated columns prevents the optimizer from pushing base-query predicates down into the underlying in-line functions and base tables. You might get better results for the ISNULL part by rewriting the LEFT JOIN as a UNION ALL of an inner join and an anti-semi-join of the two tables. Properly written, this would expose only base column names, allowing the optimizer to push predicates successfully.
The in-line functions use an MSUDF to split the supplied CSV into row values. This is a pattern that often causes optimization problems, because the size and distribution of the result is unknown. A query plan that is optimal for a CSV of '1,2,3' is likely not optimal for a NULL CSV. Again OPTION (RECOMPILE) could help with this. It will simplify the compiled logic for the NULL case, avoid parameter-sniffing issues, as well as providing cardinality information for the MSUDF result. Distribution statistics will still not be available though.

SQL Server – Handling Strings Representing Negative Numbers

You should really consider fixing this data before it gets into your tables, because working around bad data is cumbersome.

DECLARE @x TABLE(col VARCHAR(15));

INSERT @x(col) VALUES
('00000000014545p'),('00000000012645n'),
('00000000012345p'),('000000004512345');

;WITH x AS 
(
  SELECT col, switch = CASE WHEN RIGHT(col,1) NOT LIKE '[0-9]' THEN 1 END
  FROM @x
)
SELECT SUM(CONVERT(INT, LEFT(col, 
  CASE switch WHEN 1 THEN LEN(col)-1 ELSE LEN(col) END)) 
   * CASE switch WHEN 1 THEN -1 ELSE 1 END)
FROM x
-- WHERE ...;

Best Answer

Related Solutions

Sql-server – T-SQL View — How to ‘pre-fetch’ schema using scalar function, then populate using table query

SQL Server – Handling Strings Representing Negative Numbers

Related Question