[0-9]
is not some type of regular expression defined to just match digits.
Any range in a LIKE
pattern matches characters between the start and end character according to collation sort order.
SELECT CodePoint,
Symbol,
RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
FROM #CodePage
WHERE Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
ORDER BY Symbol COLLATE Latin1_General_CI_AS
Returns
CodePoint Symbol Rnk
-------------------- ------ --------------------
48 0 1
188 ¼ 2
189 ½ 3
190 ¾ 4
185 ¹ 5
49 1 5
50 2 7
178 ² 7
179 ³ 9
51 3 9
52 4 11
53 5 12
54 6 13
55 7 14
56 8 15
57 9 16
So you get these results because under your default collation these characters sort after 0
but before 9
.
It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between 0
and 1
.
You could also use a set rather than a range. To avoid 2
matching ²
you would need a CS
collation
SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_AS
The biggest difficulty in coming to a precise solution is in defining exactly what characters are to be included (or excluded, whichever direction makes more sense for the operation). Meaning:
- Are we talking about
VARCHAR
/ ASCII data or NVARCHAR
/ Unicode data? The list of punctuation characters for ASCII data depends on the Code Page which in turn depends on the Collation. (in this Question we are dealing with ASCII data).
- Are we dealing with case-sensitive or case-insensitive searches?
- What Collation is the column set to? The Collation will tell us both the Code Page and case-sensitivity. (in this question we are dealing with
Latin1_General_CI_AS
)
- is the term "punctuation" to mean just standard punctuation characters (e.g.
.
, ,
, ;
, :
, etc) or does it mean non-alphanumeric characters?
- Are whitespace characters included?
- Are Control characters included?
- What about currency symbols such as
¢
, £
, ¥
, etc?
- What about symbols such as
©
and ™
?
- What characters are considered "alpha"? Are non-English characters such as
Â
, É
, Ñ
, ß
, Þ
included?
- Since this Question deals with UK keyboards (see discussion for this Question), what about the
Æ
/ æ
character?
In order to help facilitate clarity regarding expected behavior, the following query will show all 256 characters of the Latin1 character set (i.e. Code Page 1252) and how two variations of @Shaneis's proposed solution operate. The first field (labeled as Latin1_General_CI_AS
) shows the LIKE
clause as proposed by @Shaneis (as of this writing) and the second field (labeled as Latin1_General_100_BIN2
) shows a modification where I overrode the Collation to specify a binary one (i.e. a Collation ending in _BIN2
; the _BIN
Collations are deprecated so don't use them if you have access to the _BIN2
versions) which meant I also needed to add in the A-Z
range to filter out upper-case letters as the current Collation is case-insensitive:
;WITH nums AS
(
SELECT TOP (256) (ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1) AS [Decimal]
FROM [master].[sys].[all_objects]
)
SELECT nm.[Decimal],
CHAR(nm.[Decimal]) AS [Character],
CASE WHEN CHAR(nm.[Decimal]) LIKE '%[^a-z0-9]%'
THEN 'x' ELSE '' END AS [Latin1_General_CI_AS],
CASE WHEN CHAR(nm.[Decimal]) LIKE '%[^a-z0-9A-Z]%' COLLATE Latin1_General_100_BIN2
THEN 'x' ELSE '' END AS [Latin1_General_100_BIN2]
FROM nums nm;
UPDATE
It should be mentioned that IF one is truly seeking to find characters that are classified as being "punctuation" (and not "currency symbol", "mathematical symbol", etc), and IF one is not prohibited from using SQLCLR / loading a custom Assembly (SQLCLR was introduced with SQL Server 2005, and I have yet to come across a good reason for not allowing it, especially since Azure SQL Database V12 supports SAFE
Assemblies), then you can use Regular Expressions, but not for the reason that most people would guess.
Rather than using Regular Expressions to build a more functional character range, or even rather than using something like \w
(meaning any "word" character), you can specify the Unicode Category of the characters you want to filter on, and there are several defined categories:
https://www.regular-expressions.info/unicode.html#category
You can even specify the Unicode Block to filter on, such as "InBengali" or "InDingbats" or "InOptical_Character_Recognition", etc:
https://www.regular-expressions.info/unicode.html#block
There are numerous examples of creating RegEx functions for SQL Server (though most examples don't follow SQLCLR best practices), or you can download the Free version of the SQL# library (which I created), and use the scalar RegEx_IsMatch function as follows:
SQL#.RegEx_IsMatch(Unicode-String-Expression, N'\p{P}', 1, NULL)
The \p{P}
expression means \p
= Unicode Category, and {P}
= all punctuation (as opposed to a specific type of punctuation, such as "Connector Punctuation"). AND, the "Punctuation" category includes all punctuation across all languages! You can see the full list on the Unicode.org site via the following link (there are currently 717 Code Points in that category):
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AGeneral_Category%3DPunctuation%3A%5D
An updated version of the test query shown above, including another field that uses SQL#.RegEx_IsMatch with \p{P}
, and the results of all 3 tests across all 256 characters of Code Page 1252 (i.e. Latin1_General) has been posted on PasteBin.com at:
T-SQL query and results for filtering types of characters
UPDATE
The following was mentioned in the related discussion:
You've made a good point about accented characters, with them being hotel names from around the world there will be accented characters in the names, for my problem I would want to class these as valid alpha characters.
In this case:
There are 11 non-English characters that are included in the Latin1 character set / Code Page that are not matched by the a-z
range. They are: ð Ð Þ þ œ Œ š Š ž Ž Ÿ
. These need to be added to the wildcard, and while not necessary at the moment, it wouldn't hurt to add in A-Z
so that the pattern works just as well on a case-sensitive collation. The end result is:
LIKE '%[^a-zA-Z0-9ðÐÞþœŒšŠžŽŸ]%'
Considering that this data can include "hotel names from around the world", I would highly recommend changing the datatype of the column to be NVARCHAR
so that you can store all characters from all languages. Keeping this as VARCHAR
runs a very high risk of eventually having data loss since you can only represent the Latin-based languages, and not even fully for those given the six supplemental Unicode categories that provide additional Latin-related characters.
Best Answer
First you should have a Numbers table.
Use the numbers table to split your strings into rows. Get the matching characters and position using intersect. Use row_number to get only the matching characters from the start and finally rebuild the string with a
for xml
trick or string_agg.