Is there a T-SQL equivalent of the [0-9]
and [a-z]
patterns that will let me pull values from a column that contains punctuation?
For example:
Create Table #Test
(
Value VarChar(10)
)
Insert Into #Test
Values ('123a'), ('456b'), ('12ABC'),('AB!23'),('C?D789')
Select *
From #Test
Where Value like '[0-9][0-9][0-9][a-z]'
This would return values where the first 3 characters are numbers between 0 and 9 and the last character will be a letter between a and z, so would return things like 123a
and 456b
but wouldn't return a value of 12ABC
.
I want to know if there is an equivalent for punctuation as [0-9]
is for numbers and [a-z]
is for letters so that it would return AB!23
and C?D789
?
If I could use a regular expression I might use the expression ^[a-zA-Z0-9]*$
to match alphanumeric characters in a string.
Where Value like '^[a-zA-Z0-9]*$'
Is there a SQL equivalent for this?
I know this kind of thing that can be done in RegEx but I need it in T-SQL, I am not able to load any custom assemblies onto this server so can't use regular expressions.
The real column is varchar(200). The collation is Latin1_General_CI_AS. I am using SQL Server 2012 Standard Edition.
Best Answer
The biggest difficulty in coming to a precise solution is in defining exactly what characters are to be included (or excluded, whichever direction makes more sense for the operation). Meaning:
VARCHAR
/ ASCII data orNVARCHAR
/ Unicode data? The list of punctuation characters for ASCII data depends on the Code Page which in turn depends on the Collation. (in this Question we are dealing with ASCII data).Latin1_General_CI_AS
).
,,
,;
,:
, etc) or does it mean non-alphanumeric characters?¢
,£
,¥
, etc?©
and™
?Â
,É
,Ñ
,ß
,Þ
included?Æ
/æ
character?In order to help facilitate clarity regarding expected behavior, the following query will show all 256 characters of the Latin1 character set (i.e. Code Page 1252) and how two variations of @Shaneis's proposed solution operate. The first field (labeled as
Latin1_General_CI_AS
) shows theLIKE
clause as proposed by @Shaneis (as of this writing) and the second field (labeled asLatin1_General_100_BIN2
) shows a modification where I overrode the Collation to specify a binary one (i.e. a Collation ending in_BIN2
; the_BIN
Collations are deprecated so don't use them if you have access to the_BIN2
versions) which meant I also needed to add in theA-Z
range to filter out upper-case letters as the current Collation is case-insensitive:UPDATE
It should be mentioned that IF one is truly seeking to find characters that are classified as being "punctuation" (and not "currency symbol", "mathematical symbol", etc), and IF one is not prohibited from using SQLCLR / loading a custom Assembly (SQLCLR was introduced with SQL Server 2005, and I have yet to come across a good reason for not allowing it, especially since Azure SQL Database V12 supports
SAFE
Assemblies), then you can use Regular Expressions, but not for the reason that most people would guess.Rather than using Regular Expressions to build a more functional character range, or even rather than using something like
\w
(meaning any "word" character), you can specify the Unicode Category of the characters you want to filter on, and there are several defined categories:https://www.regular-expressions.info/unicode.html#category
You can even specify the Unicode Block to filter on, such as "InBengali" or "InDingbats" or "InOptical_Character_Recognition", etc:
https://www.regular-expressions.info/unicode.html#block
There are numerous examples of creating RegEx functions for SQL Server (though most examples don't follow SQLCLR best practices), or you can download the Free version of the SQL# library (which I created), and use the scalar RegEx_IsMatch function as follows:
The
\p{P}
expression means\p
= Unicode Category, and{P}
= all punctuation (as opposed to a specific type of punctuation, such as "Connector Punctuation"). AND, the "Punctuation" category includes all punctuation across all languages! You can see the full list on the Unicode.org site via the following link (there are currently 717 Code Points in that category):http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AGeneral_Category%3DPunctuation%3A%5D
An updated version of the test query shown above, including another field that uses SQL#.RegEx_IsMatch with
\p{P}
, and the results of all 3 tests across all 256 characters of Code Page 1252 (i.e. Latin1_General) has been posted on PasteBin.com at:T-SQL query and results for filtering types of characters
UPDATE
The following was mentioned in the related discussion:
In this case:
There are 11 non-English characters that are included in the Latin1 character set / Code Page that are not matched by the
a-z
range. They are:ð Ð Þ þ œ Œ š Š ž Ž Ÿ
. These need to be added to the wildcard, and while not necessary at the moment, it wouldn't hurt to add inA-Z
so that the pattern works just as well on a case-sensitive collation. The end result is:LIKE '%[^a-zA-Z0-9ðÐÞþœŒšŠžŽŸ]%'
Considering that this data can include "hotel names from around the world", I would highly recommend changing the datatype of the column to be
NVARCHAR
so that you can store all characters from all languages. Keeping this asVARCHAR
runs a very high risk of eventually having data loss since you can only represent the Latin-based languages, and not even fully for those given the six supplemental Unicode categories that provide additional Latin-related characters.