SQL Server – WHERE Clause Pattern Matching for Rows with Characters Not in List

sql serversql-server-2008

I am in a SQL Server 2008 environment.

I am trying to use pattern matching in the WHERE clause to find rows where a certain column's value contains characters that are not alpha-numeric, underscore, dash, period, or space.

This is my code, and sample data, but I am not getting the results expected.

In the example data, I want to return rows 7, 8, 9, and 12, but I am getting rows 5 and 6.

If this isn't the best way to achieve the goal, I am open to hearing other methods.

I am not in an environment where I can implement regex so my solution limited to out-of-the-box functionality.

CREATE TABLE PATTERN_TEST
(
ID INT NOT NULL,
STRING NVARCHAR(40) NOT NULL
)

INSERT INTO PATTERN_TEST
SELECT 1, 'string' UNION 
SELECT 2, 'STRING' UNION 
SELECT 3, 'string space' UNION 
SELECT 4, 'STRING SPACE' UNION 
SELECT 5, 'string-dash' UNION 
SELECT 6, 'string-dash space' UNION 
SELECT 7, 'string "otherchar"' UNION 
SELECT 8, 'string "other char"' UNION 
SELECT 9, '"string"' UNION 
SELECT 10, 'string_underscore' UNION 
SELECT 11, 'string_underscore space' UNION
SELECT 12, '"'
;

SELECT * FROM PATTERN_TEST WHERE STRING LIKE '%[^a-zA-Z0-9_ -.]%';

Best Answer

This will work as well:

SELECT * 
FROM PATTERN_TEST 
WHERE STRING LIKE '%[^a-zA-Z0-9_ .-]%' ;

Tested at rextester.com

The only difference is that dash (-) is put at the end of the [...] pattern. It has to do with the special meaning of the dash character (A-Z is interpreted as "any character from A to Z").

In your pattern the last three characters (space-dash-dot: -.) are interpreted as "any character from space () to dot (.)", and the results are not the wanted/expected.

You could also escape the dash:

WHERE STRING LIKE '%[^a-zA-Z0-9_ $-.]%' ESCAPE '$' ;

Note though that using ESCAPE may adversely affect cardinality estimation.

See the MSDN documentation for some details: LIKE:

If there is no character after an escape character in the LIKE pattern, the pattern is not valid and the LIKE returns FALSE. If the character after an escape character is not a wildcard character, the escape character is discarded and the character following the escape is treated as a regular character in the pattern. This includes the percent sign (%), underscore (_), and left bracket ([) wildcard characters when they are enclosed in double brackets ([ ]). Also, within the double bracket characters ([ ]), escape characters can be used and the caret (^), hyphen (-), and right bracket (]) can be escaped.

To be clear: special characters need escaping when they are used as non-special. Inside [], the three (^, -, ]) are special. Putting the dash at the end is more like a hack.