Sql-server – Find first non-matching character between two strings

sql servert-sql

Posting this with solution provided.

The use case could be framed in a number of ways:

Return all characters that match at the start of two strings, until they do not match
How many characters match between two strings before one character doesn't match?
What is the first character that doesn't match between two strings?
etc.

For example:

If I have the strings 'Interesting' and 'Interested' they are similar for all the characters 'Interest' and then the first string ends in 'ing' while the second ends in 'ed'. Therefore, they have 8 characters in common, the first non-matching character is the 9th and the identical string is 'Interest'.

For the question in the title, the first non-matching character is number 9.

Best Answer

First you should have a Numbers table.

create table dbo.Number(N int not null);
go

alter table dbo.Number add constraint PK_Number_N primary key clustered (N);

go

;with C(N) as
(
  select 0
  union all
  select N+1
  from C
  where N < 11000
)
insert into dbo.Number(N)
select N
from C
option (maxrecursion 0);

Use the numbers table to split your strings into rows. Get the matching characters and position using intersect. Use row_number to get only the matching characters from the start and finally rebuild the string with a for xml trick or string_agg.

declare @Str1 varchar(20) = 'Interesting0a';
declare @Str2 varchar(20) = 'Interested00a';

with C1 as
(
  select N.N,
         substring(@Str1, N.N, 1) as C
  from dbo.Number as N
  where N.N between 1 and len(@Str1)
  intersect
  select N.N,
         substring(@Str2, N.N, 1)
  from dbo.Number as N
  where N.N between 1 and len(@Str2)
),
C2 as
(
  select C1.N,
         C1.C,
         row_number() over(order by C1.N) as rn
  from C1
)
select string_agg(C2.C, '') within group(order by C2.N)
from C2
where C2.N = C2.rn;

/*
select (
       select C2.C as '*'
       from C2
       where C2.N = C2.rn
       order by C2.N
       for xml path(''), type
       ).value('text()[1]', 'varchar(20)');
*/

Related Solutions

SQL Server – Why Non-Digits Match LIKE [0-9]

[0-9] is not some type of regular expression defined to just match digits.

Any range in a LIKE pattern matches characters between the start and end character according to collation sort order.

SELECT CodePoint,
       Symbol,
       RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
FROM   #CodePage
WHERE  Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
ORDER  BY Symbol COLLATE Latin1_General_CI_AS

Returns

CodePoint            Symbol Rnk
-------------------- ------ --------------------
48                   0      1
188                  ¼      2
189                  ½      3
190                  ¾      4
185                  ¹      5
49                   1      5
50                   2      7
178                  ²      7
179                  ³      9
51                   3      9
52                   4      11
53                   5      12
54                   6      13
55                   7      14
56                   8      15
57                   9      16

So you get these results because under your default collation these characters sort after 0 but before 9.

It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between 0 and 1.

You could also use a set rather than a range. To avoid 2 matching ² you would need a CS collation

SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_AS

T-SQL Equivalent for Punctuation Similar to [0-9] for Numbers and [a-z] for Letters

The biggest difficulty in coming to a precise solution is in defining exactly what characters are to be included (or excluded, whichever direction makes more sense for the operation). Meaning:

Are we talking about VARCHAR / ASCII data or NVARCHAR / Unicode data? The list of punctuation characters for ASCII data depends on the Code Page which in turn depends on the Collation. (in this Question we are dealing with ASCII data).
Are we dealing with case-sensitive or case-insensitive searches?
What Collation is the column set to? The Collation will tell us both the Code Page and case-sensitivity. (in this question we are dealing with Latin1_General_CI_AS)
is the term "punctuation" to mean just standard punctuation characters (e.g. ., ,, ;, :, etc) or does it mean non-alphanumeric characters?
Are whitespace characters included?
Are Control characters included?
What about currency symbols such as ¢, £, ¥, etc?
What about symbols such as © and ™?
What characters are considered "alpha"? Are non-English characters such as Â, É, Ñ, ß, Þ included?
Since this Question deals with UK keyboards (see discussion for this Question), what about the Æ / æ character?

In order to help facilitate clarity regarding expected behavior, the following query will show all 256 characters of the Latin1 character set (i.e. Code Page 1252) and how two variations of @Shaneis's proposed solution operate. The first field (labeled as Latin1_General_CI_AS) shows the LIKE clause as proposed by @Shaneis (as of this writing) and the second field (labeled as Latin1_General_100_BIN2) shows a modification where I overrode the Collation to specify a binary one (i.e. a Collation ending in _BIN2; the _BIN Collations are deprecated so don't use them if you have access to the _BIN2 versions) which meant I also needed to add in the A-Z range to filter out upper-case letters as the current Collation is case-insensitive:

;WITH nums AS
(
  SELECT TOP (256) (ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1) AS [Decimal]
  FROM   [master].[sys].[all_objects]
)
SELECT nm.[Decimal],
       CHAR(nm.[Decimal]) AS [Character],
       CASE WHEN CHAR(nm.[Decimal]) LIKE '%[^a-z0-9]%'
               THEN 'x' ELSE '' END AS [Latin1_General_CI_AS],
       CASE WHEN CHAR(nm.[Decimal]) LIKE '%[^a-z0-9A-Z]%' COLLATE Latin1_General_100_BIN2
               THEN 'x' ELSE '' END AS [Latin1_General_100_BIN2]
FROM   nums nm;

UPDATE

It should be mentioned that IF one is truly seeking to find characters that are classified as being "punctuation" (and not "currency symbol", "mathematical symbol", etc), and IF one is not prohibited from using SQLCLR / loading a custom Assembly (SQLCLR was introduced with SQL Server 2005, and I have yet to come across a good reason for not allowing it, especially since Azure SQL Database V12 supports SAFE Assemblies), then you can use Regular Expressions, but not for the reason that most people would guess.

Rather than using Regular Expressions to build a more functional character range, or even rather than using something like \w (meaning any "word" character), you can specify the Unicode Category of the characters you want to filter on, and there are several defined categories:

https://www.regular-expressions.info/unicode.html#category

You can even specify the Unicode Block to filter on, such as "InBengali" or "InDingbats" or "InOptical_Character_Recognition", etc:

https://www.regular-expressions.info/unicode.html#block

There are numerous examples of creating RegEx functions for SQL Server (though most examples don't follow SQLCLR best practices), or you can download the Free version of the SQL# library (which I created), and use the scalar RegEx_IsMatch function as follows:

SQL#.RegEx_IsMatch(Unicode-String-Expression, N'\p{P}', 1, NULL)

The \p{P} expression means \p = Unicode Category, and {P} = all punctuation (as opposed to a specific type of punctuation, such as "Connector Punctuation"). AND, the "Punctuation" category includes all punctuation across all languages! You can see the full list on the Unicode.org site via the following link (there are currently 717 Code Points in that category):

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AGeneral_Category%3DPunctuation%3A%5D

An updated version of the test query shown above, including another field that uses SQL#.RegEx_IsMatch with \p{P}, and the results of all 3 tests across all 256 characters of Code Page 1252 (i.e. Latin1_General) has been posted on PasteBin.com at:

T-SQL query and results for filtering types of characters

UPDATE
The following was mentioned in the related discussion:

You've made a good point about accented characters, with them being hotel names from around the world there will be accented characters in the names, for my problem I would want to class these as valid alpha characters.

In this case:

There are 11 non-English characters that are included in the Latin1 character set / Code Page that are not matched by the a-z range. They are: ð Ð Þ þ œ Œ š Š ž Ž Ÿ. These need to be added to the wildcard, and while not necessary at the moment, it wouldn't hurt to add in A-Z so that the pattern works just as well on a case-sensitive collation. The end result is:
LIKE '%[^a-zA-Z0-9ðÐÞþœŒšŠžŽŸ]%'
Considering that this data can include "hotel names from around the world", I would highly recommend changing the datatype of the column to be NVARCHAR so that you can store all characters from all languages. Keeping this as VARCHAR runs a very high risk of eventually having data loss since you can only represent the Latin-based languages, and not even fully for those given the six supplemental Unicode categories that provide additional Latin-related characters.

Best Answer

Related Solutions

SQL Server – Why Non-Digits Match LIKE [0-9]

T-SQL Equivalent for Punctuation Similar to [0-9] for Numbers and [a-z] for Letters

Related Question