Sql-server – Why are non-digits LIKE [0-9]

collationsql server

My server's default collation is Latin1_General_CI_AS, as determined by this query:

SELECT SERVERPROPERTY('Collation') AS Collation;

I was surprised to discover that with this collation I can match non-digit characters in strings using the predicate LIKE '[0-9]'.

Why in the default collation does this happen? I can't think of a case where this would be useful. I know I can work around the behavior using a binary collation, but it seems like a strange way to implement the default collation.

Filtering digits produces non-digit caracters

I can demonstrate the behavior by creating a column that contains all possible single-byte character values and filtering the values with the digit-matching predicate.

The following statement creates a temporary table with 256 rows, one for each code point in the current code page:

WITH P0(_) AS (SELECT 0 UNION ALL SELECT 0),
P1(_) AS (SELECT 0 FROM P0 AS L CROSS JOIN P0 AS R),
P2(_) AS (SELECT 0 FROM P1 AS L CROSS JOIN P1 AS R),
P3(_) AS (SELECT 0 FROM P2 AS L CROSS JOIN P2 AS R),
Tally(Number) AS (
  SELECT -1 + ROW_NUMBER() OVER (ORDER BY (SELECT 0))
  FROM P3
)
SELECT Number AS CodePoint, CHAR(Number) AS Symbol
INTO #CodePage
FROM Tally
WHERE Number >= 0 AND Number <= 255;

Each row contains the integer value of the code point, and the character value of the code point. Not all of the character values are displayable – some of the code points are strictly control characters. Here is a selective sample of the output of SELECT CodePoint, Symbol FROM #CodePage:

0   
1   
2   
...
32   
33  !
34  "
35  #
...
48  0
49  1
50  2
...
65  A
66  B
67  C
...
253 ý
254 þ
255 ÿ

I would expect to be able to filter on the Symbol column to find digit characters using a LIKE predicate and specifying the range of characters '0' thru '9':

SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0-9]';

It produces a surprising output:

CodePoint   Symbol
48  0
49  1
50  2
51  3
52  4
53  5
54  6
55  7
56  8
57  9
178 ²
179 ³
185 ¹
188 ¼
189 ½
190 ¾

The set of code points 48 thru 57 are the ones I expect. What surprises me is that the symbols for superscripts and fractions are also included in the result set!

There might be a mathematical reason to think of exponents and fractions as numbers, but it seems wrong to call them digits.

Using binary collation as a workaround

I understand that to get the result I expect, I can force the corresponding binary collation Latin1_General_BIN:

SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0-9]' COLLATE Latin1_General_BIN;

The result set includes only the code points 48 thru 57:

CodePoint   Symbol
48  0
49  1
50  2
51  3
52  4
53  5
54  6
55  7
56  8
57  9

Best Answer

[0-9] is not some type of regular expression defined to just match digits.

Any range in a LIKE pattern matches characters between the start and end character according to collation sort order.

SELECT CodePoint,
       Symbol,
       RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
FROM   #CodePage
WHERE  Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
ORDER  BY Symbol COLLATE Latin1_General_CI_AS 

Returns

CodePoint            Symbol Rnk
-------------------- ------ --------------------
48                   0      1
188                  ¼      2
189                  ½      3
190                  ¾      4
185                  ¹      5
49                   1      5
50                   2      7
178                  ²      7
179                  ³      9
51                   3      9
52                   4      11
53                   5      12
54                   6      13
55                   7      14
56                   8      15
57                   9      16

So you get these results because under your default collation these characters sort after 0 but before 9.

It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between 0 and 1.

You could also use a set rather than a range. To avoid 2 matching ² you would need a CS collation

SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_AS