Sql-server – How to find values with multiple consecutive upper case characters

collationsql serversql server 2014t-sqlunicode

For a project that I am working on, I need to identify values that have incorrect uppercase.

For example, I need it to identify the following types of values:

Mr JOHN Smith
MR John Smith
Mr John SMITH
Mr JOhn Smith

Ideas I have tried so far:

Select * from table where Name = upper(Name) 
  collate SQL_Latin1_General_CP1_CS_AS

(which gives results for rows that are all upper case, such as MR JOHN SMITH)

and

Select * from table where right(Name,3) = upper(right(Name,3)) 
  collate SQL_Latin1_General_CP1_CS_AS

which picks up some rows too.

But I doubt this is the most efficient way to go about this.

Best Answer

This question is far more complicated than it appears to be on the surface (hence the longer-than-most-would-expect answer). If the strings being searched were codes (postal codes, ISO country codes, ISO state codes, SKUs, etc) or something where the characters used were a limited subset of all possible letters of all languages, then this would be fairly straight-forward. But when dealing with people's names, then no such luck.

                        All of the following example code and test cases can be found on Pastebin ( Searching
for Case-Sensitive patterns in SQL Server
). The SQL posted on Pastebin includes
additional test cases and additional examples related to various points mentioned below.

Very simplistically you can do the following, which does work with the sample data. I added two test cases to help determine whether or not this (or any) method works as it is data that should not match.

SET NOCOUNT ON;
DECLARE @SampleData TABLE
(
  Name NVARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
);
INSERT INTO @SampleData (Name) VALUES (N'Mr JOHN Smith');
INSERT INTO @SampleData (Name) VALUES (N'MR John Smith');
INSERT INTO @SampleData (Name) VALUES (N'Mr John SMITH');
INSERT INTO @SampleData (Name) VALUES (N'Mr JOhn Smith');
INSERT INTO @SampleData (Name) VALUES (N'mr john smith'); -- all lower-case
INSERT INTO @SampleData (Name) VALUES (N'Mr John Smith'); -- proper-case

SELECT *
FROM @SampleData
WHERE Name COLLATE Latin1_General_100_CS_AS
                 LIKE N'%[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]%';

Please note:

  1. I am not using a collation that starts with SQL_ as those have been deprecated (or perhaps more accurately: obsoleted) since SQL Server 2000 was released.

  2. I am using a collation in the 100 series as that is the most recent, and should be available starting in SQL Server 2008. Not all collations have a 100 series, but it is best to use the most recent one, which might be 90.

  3. I am not using a binary collation (i.e. ending in _BIN2 or even _BIN though the _BIN collations have been deprecated since SQL Server 2005 was released, so only use _BIN2 if needing a binary collation). Binary collations are not "case sensitive"; they are "byte sensitive" (_BIN) / "code point sensitive" (_BIN2) and those are very different concepts. Binary collations only appear to be case sensitive when working with (or testing) US-English-only characters. Any characters with accents can potentially be handled incorrectly when using binary collations. For example:

    PRINT N'U' + NCHAR(0x0308) +
       N' <-- U + NCHAR(0x0308) -- combining diaeresis [1 "character" from 2 code points / 4 bytes]';
       -- Ü
    PRINT NCHAR(220) + N' <-- NCHAR(220) [1 "character" from 1 code point / 2 bytes]'; -- Ü
    PRINT NCHAR(252) + N' <-- NCHAR(252) [1 "character" from 1 code point / 2 bytes]'; -- ü
    PRINT '';
    
    IF (NCHAR(252) = N'U'+NCHAR(0x0308) COLLATE Latin1_General_100_CI_AS)
            PRINT 'Same!' ELSE PRINT 'Nope.'; -- case INsensitive
    IF (NCHAR(252) = N'U'+NCHAR(0x0308) COLLATE Latin1_General_100_CS_AS)
            PRINT 'Same!' ELSE PRINT 'Nope.'; -- case SENSITIVE
    IF (NCHAR(220) = N'U'+NCHAR(0x0308) COLLATE Latin1_General_100_CS_AS)
            PRINT 'Same!' ELSE PRINT 'Nope.'; -- case SENSITIVE
    IF (NCHAR(220) = N'U'+NCHAR(0x0308) COLLATE Latin1_General_100_BIN2)
            PRINT 'Same!' ELSE PRINT 'Nope.'; -- Binary
    

    For a more in-depth look into this topic, please see my blog post: No, Binary Collations are not Case-Sensitive.

  4. Instead of using a single character range (i.e. the [A-Z] that works in LIKE and PATINDEX) I added each of the 26 letters. This is required in order for the preferred _CS (i.e. case sensitive) collations to work properly. If you specify a range of characters via the [A-Z] syntax, then it will appear to not respect the case sensitive collation. That, however, is a false interpretation of the behavior. Case sensitive comparisons can be tricky since they, by design, work differently between equality comparisons (e.g. WHERE N'zbz' LIKE N'%B%' or WHERE N'b' = N'B') and range comparisons (e.g. WHERE N'zbz' LIKE N'%[A-C]%'). Range comparisons are more like sorting. And, case sensitive sorting does not handle all of one case before the other case. Think in terms of how dictionaries are ordered (and in fact, the Microsoft documentation even uses the term "dictionary ordering" to describe non-binary collations): they don't place all words starting with A-Z prior to words starting with a-z. They group upper and lower case of each letter together: A,a,B,b,C,c, and so on, not A,B,C,a,b,c. Hence, assuming that upper-case are first (and which one comes first depends on if you are using a SQL_ or non-SQL_ collation), a range of [A-C] equates to [AaBbC], and if lower-case are first, that same range equates to [AbBcC]. In both cases, the lower-case "b" is validly in that range.

However, this approach mainly works if the data you are searching for is guaranteed to only ever have the 26 US-English characters in it, and never ever any other characters found in most other languages, especially those with accents / diacritical marks. Given that your data contains names, you cannot guarantee that, even if working in the US since people live here who came from somewhere else (or an ancestor did), or sign up for services hosted here even though they live anywhere in the world.

For example, add the following two test cases to the example code above:

INSERT INTO @SampleData (Name) VALUES (N'Mr ÜLala Jones');
INSERT INTO @SampleData (Name) VALUES (N'Mr üLala Jones');

Running the code again will not yield new results. The second line added uses a lower-case ü so it shouldn't match anyway. But the first line uses an upper-case Ü so it should match, but can't unless we either make the collation accent-insensitive (i.e. Latin1_General_100_CS_AI) or add the Ü character to both character ranges of the LIKE clause (e.g. [ABCDEFGHIJKLMNOPQRSTUVWXYZÜ]).

Making the comparison accent-insensitive might be an option when looking for any two capital letters next to each other, but not when comparing actual words. And adding the extra letters (the ones with accents) only works when using VARCHAR (i.e. Extended ASCII) data. If that is the case then this approach could be made to work IF you added all of the additional capital letters available in the code page specified by the Collation. To find that, you can do the following:

SELECT COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'CodePage');
-- 1252

According to the Wikipedia page for Code Page 1252, there are 32 additional capitals, which are (I made it a heading so that the characters would be easier to see):

Š Ž ÀÁÂÃÄÅ Æ Ç ÈÉÊË ÌÍÎÏ Ð Ñ ÒÓÔÕÖ Ø ÙÚÛÜ Ý Þ

The result would be a LIKE clause single-character specification of:

[ABCDEFGHIJKLMNOPQRSTUVWXYZŠŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]

But that particular set of characters only works with Code Page 1252. If your collation uses another Code Page then that set of extra characters would most likely be different, even if there is some overlap.


On the other hand, if you are using Unicode / NVARCHAR data, then even adding those extra characters will not work. Not only can Unicode represent all accented characters (unlike any individual Code Page), it also allows for constructing accented characters by starting with a regular, non-accented character, such as "U" and then adding one or more accents to it. These accents are called "Combining Characters" since they only work in combination with certain base characters (for a more detailed look into Combining Characters, please see my answer to How To Strip Hebrew Accent Marks ). An example of this was shown in Note #3 above (e.g. PRINT N'U' + NCHAR(0x0308); -- Ü ).

If we now add the following two test cases to the example code above:

INSERT INTO @SampleData (Name) VALUES (N'Mr U' + NCHAR(0x0308) + N'Lala Jonez');
INSERT INTO @SampleData (Name) VALUES (N'Mr U' + NCHAR(0x0308) + NCHAR(0x0309) +
                                     NCHAR(0x0308) + N'Lala Jonezzz');

and re-run the test, those new entries will not show up, at least not when using the original LIKE definition that specified just the 26 letters of the US-English alphabet. That first new entry would show up if we added the Ü character since U + that combining character equate to the same thing. So, is it feasible to add the remaining characters like I showed above in the VARCHAR example? That was 32 extra characters. How many extra characters for NVARCHAR? About 1200 ;-). Um no, not feasible.

And if you did add in all 1200 additional capital letters, would that work? Nope, because there can still be more capitals that simply don't have a "pre-composed" equivalent. For example, that second new entry won't show up since there is no equivalent character to put into the LIKE definition. The only way to make this work would be to not only add in all 1200 additional capitals, but also use an accent-insensitive collation. This still doesn't seem very feasible. And it still would only work in limited scenarios such as this one (looking for any capitals, not specific ones).

Unfortunately, T-SQL is just not equipped to deal with this.

But fortunately, there is a way of accomplishing this: Regular Expressions (commonly referred to as just RegEx). RegEx allows for specifying "classes" of characters (e.g. \s = "any white-space character"), and one class -- \p{property} -- will match characters based on certain properties of Unicode characters. The following two class specifications will allow us to get any capital letter, even if it doesn't exist today:

  • \p{Lu} = An upper-case letter that has a lower-case equivalent
  • \p{M} = A combining mark

We can use those to come up with the following RegEx pattern:

(?:\p{Lu}\p{M}*){2}

To translate this for those who aren't familiar with RegEx syntax:

  • The ( and ) group everything inside to be a single pattern that can be repeated or referenced later
  • The ?: tells the RegEx engine to not save any matches for that grouping. We won't be using whatever matches so it is more efficient to not store it in memory.
  • The * indicates that there should be 0 or more matches for the preceding item (i.e. the \p{M}). This will match any number of diacritical marks, or none.
  • \p{Lu}\p{M}* altogether says to match any single upper-case letter by itself, or that is followed by any number of diacritical marks. This will match, for example: U, Ü, U + NCHAR(0x0308), and so on.
  • The {2} indicates that a match will consist of exactly 2 occurrences of the preceding item (i.e. the (?:\p{Lu}\p{M}*)). Without this part it would only match a single upper-case letter, not 2. And this is why we needed the parenthesis: to group the upper-case letter and the zero-or-more combining marks together to be a sub-pattern that could be repeated.

The only part left to take care of is the fact that SQL Server does not have any built-in support for Regular Expressions. For this we turn to SQLCLR which allows us to use the RegEx functionality that exists in .NET. If you do not already have SQLCLR RegEx functions, there are many code examples around the Interwebs (just be careful since a few are pretty good, but many are pretty bad), or you can download the Free version of SQL#, a pre-compiled library of SQLCLR functions (that I created) which contains several RegEx functions, including the one used in the following example code:

SELECT *
FROM @SampleData
WHERE SQL#.RegEx_IsMatch4k(Name, N'(?:\p{Lu}\p{M}*){2}', 1, '') = 1;

Running against the test data again will bring back all of the expected entries, including the final one with the 3 diacritical marks that isn't even a character in any language (at least, not yet). Using RegEx with this pattern will just work, always, for all known languages (well, languages that have capitals since some do not).