T-SQL Equivalent for Punctuation Similar to [0-9] for Numbers and [a-z] for Letters

regular expressionsql serversql-server-2012t-sql

Is there a T-SQL equivalent of the [0-9] and [a-z] patterns that will let me pull values from a column that contains punctuation?

For example:

Create Table #Test
(
Value   VarChar(10)
) 
Insert Into #Test
Values ('123a'), ('456b'), ('12ABC'),('AB!23'),('C?D789')

Select      *
From        #Test
Where       Value like '[0-9][0-9][0-9][a-z]'

This would return values where the first 3 characters are numbers between 0 and 9 and the last character will be a letter between a and z, so would return things like 123a and 456b but wouldn't return a value of 12ABC.

I want to know if there is an equivalent for punctuation as [0-9] is for numbers and [a-z] is for letters so that it would return AB!23 and C?D789?

If I could use a regular expression I might use the expression ^[a-zA-Z0-9]*$ to match alphanumeric characters in a string.

Where       Value like '^[a-zA-Z0-9]*$'

Is there a SQL equivalent for this?

I know this kind of thing that can be done in RegEx but I need it in T-SQL, I am not able to load any custom assemblies onto this server so can't use regular expressions.

The real column is varchar(200). The collation is Latin1_General_CI_AS. I am using SQL Server 2012 Standard Edition.

Best Answer

The biggest difficulty in coming to a precise solution is in defining exactly what characters are to be included (or excluded, whichever direction makes more sense for the operation). Meaning:

Are we talking about VARCHAR / ASCII data or NVARCHAR / Unicode data? The list of punctuation characters for ASCII data depends on the Code Page which in turn depends on the Collation. (in this Question we are dealing with ASCII data).
Are we dealing with case-sensitive or case-insensitive searches?
What Collation is the column set to? The Collation will tell us both the Code Page and case-sensitivity. (in this question we are dealing with Latin1_General_CI_AS)
is the term "punctuation" to mean just standard punctuation characters (e.g. ., ,, ;, :, etc) or does it mean non-alphanumeric characters?
Are whitespace characters included?
Are Control characters included?
What about currency symbols such as ¢, £, ¥, etc?
What about symbols such as © and ™?
What characters are considered "alpha"? Are non-English characters such as Â, É, Ñ, ß, Þ included?
Since this Question deals with UK keyboards (see discussion for this Question), what about the Æ / æ character?

In order to help facilitate clarity regarding expected behavior, the following query will show all 256 characters of the Latin1 character set (i.e. Code Page 1252) and how two variations of @Shaneis's proposed solution operate. The first field (labeled as Latin1_General_CI_AS) shows the LIKE clause as proposed by @Shaneis (as of this writing) and the second field (labeled as Latin1_General_100_BIN2) shows a modification where I overrode the Collation to specify a binary one (i.e. a Collation ending in _BIN2; the _BIN Collations are deprecated so don't use them if you have access to the _BIN2 versions) which meant I also needed to add in the A-Z range to filter out upper-case letters as the current Collation is case-insensitive:

;WITH nums AS
(
  SELECT TOP (256) (ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1) AS [Decimal]
  FROM   [master].[sys].[all_objects]
)
SELECT nm.[Decimal],
       CHAR(nm.[Decimal]) AS [Character],
       CASE WHEN CHAR(nm.[Decimal]) LIKE '%[^a-z0-9]%'
               THEN 'x' ELSE '' END AS [Latin1_General_CI_AS],
       CASE WHEN CHAR(nm.[Decimal]) LIKE '%[^a-z0-9A-Z]%' COLLATE Latin1_General_100_BIN2
               THEN 'x' ELSE '' END AS [Latin1_General_100_BIN2]
FROM   nums nm;

UPDATE

It should be mentioned that IF one is truly seeking to find characters that are classified as being "punctuation" (and not "currency symbol", "mathematical symbol", etc), and IF one is not prohibited from using SQLCLR / loading a custom Assembly (SQLCLR was introduced with SQL Server 2005, and I have yet to come across a good reason for not allowing it, especially since Azure SQL Database V12 supports SAFE Assemblies), then you can use Regular Expressions, but not for the reason that most people would guess.

Rather than using Regular Expressions to build a more functional character range, or even rather than using something like \w (meaning any "word" character), you can specify the Unicode Category of the characters you want to filter on, and there are several defined categories:

https://www.regular-expressions.info/unicode.html#category

You can even specify the Unicode Block to filter on, such as "InBengali" or "InDingbats" or "InOptical_Character_Recognition", etc:

https://www.regular-expressions.info/unicode.html#block

There are numerous examples of creating RegEx functions for SQL Server (though most examples don't follow SQLCLR best practices), or you can download the Free version of the SQL# library (which I created), and use the scalar RegEx_IsMatch function as follows:

SQL#.RegEx_IsMatch(Unicode-String-Expression, N'\p{P}', 1, NULL)

The \p{P} expression means \p = Unicode Category, and {P} = all punctuation (as opposed to a specific type of punctuation, such as "Connector Punctuation"). AND, the "Punctuation" category includes all punctuation across all languages! You can see the full list on the Unicode.org site via the following link (there are currently 717 Code Points in that category):

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AGeneral_Category%3DPunctuation%3A%5D

An updated version of the test query shown above, including another field that uses SQL#.RegEx_IsMatch with \p{P}, and the results of all 3 tests across all 256 characters of Code Page 1252 (i.e. Latin1_General) has been posted on PasteBin.com at:

T-SQL query and results for filtering types of characters

UPDATE
The following was mentioned in the related discussion:

You've made a good point about accented characters, with them being hotel names from around the world there will be accented characters in the names, for my problem I would want to class these as valid alpha characters.

In this case:

There are 11 non-English characters that are included in the Latin1 character set / Code Page that are not matched by the a-z range. They are: ð Ð Þ þ œ Œ š Š ž Ž Ÿ. These need to be added to the wildcard, and while not necessary at the moment, it wouldn't hurt to add in A-Z so that the pattern works just as well on a case-sensitive collation. The end result is:
LIKE '%[^a-zA-Z0-9ðÐÞþœŒšŠžŽŸ]%'
Considering that this data can include "hotel names from around the world", I would highly recommend changing the datatype of the column to be NVARCHAR so that you can store all characters from all languages. Keeping this as VARCHAR runs a very high risk of eventually having data loss since you can only represent the Latin-based languages, and not even fully for those given the six supplemental Unicode categories that provide additional Latin-related characters.

Best Answer

Related Solutions

MySQL – REGEXP for Text Not Between Parentheses

Mysql – How to store a 64-digit key of alpha numerical values into MySQL Database

Related Question