Sql-server – charindex thrown off by extended characters

character-setencodingsql serversql-server-2016varchar

I've got a column with a filenames stored in a nvarchar(255). I'm trying to parse out the file extension using

reverse( left( reverse(filename), charindex('.', reverse(filename) ) -1 ) )

This works great for most of my records, however some are failing due to charindex returning 0 while a '.' is definitely part of the string.

I think I've isolated it to an issue with accented characters and a few others.

One of them I have is ité.pdf, when reversed, charindex is not finding the '.'

fdp.́eti

I've taken to converting it to it's hex representation to narrow down what's going on using

cast(reverse(filename) as varbinary(max))

0x6600640070002E000103650074006900

From what I can work out The '.' is 2E 00. 01 03 65 00 is the 'é' which is encoded differently than the others I have in the same column.

Since 2E 00 comes first I would still expect charindex to get a match, but it's returning me 0, which in turn make the call to LEFT fail.

Am I missing something obvious here? Converted to HEX it seems the '.' I am looking for is definitely there, but the character after is tripping charindex somehow.

EDIT: Simplified test

When outputting the string there is clearly a ., but charindex is not finding it?

select charindex('.', cast(0x6600640070002E000103650074006900 as nvarchar(max)))

Fiddle:
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=cda6f0a77fa9e206bff9217ec58be075

Best Answer

Apparently it seems related to a collation or encoding issue.

I'd check database and column collations to ensure both are the same.

Server collation

SELECT CONVERT (varchar, SERVERPROPERTY('collation')) AS 'Server Collation';

Database collation

SELECT 
    name, 
    collation_name 
FROM 
    sys.databases
WHERE 
    name = 'DBName';

Column collation

SELECT 
    name, 
    collation_name 
FROM 
    sys.columns 
WHERE 
    name = N'ColumnName';

Quoted from MS-Docs Collation precedende

Functions and Collation

THE CAST, CONVERT, and COLLATE functions are collation sensitive for char, varchar, and text data types. If the input and output of the CAST and CONVERT functions are character strings, the output string has the collation label of the input string. If the input is not a character string, the output string is Coercible-default and assigned the collation of the current database for the connection, or the database that contains the user-defined function, stored procedure, or trigger in which the CAST or CONVERT is referenced.

For the built-in functions that return a string but do not take a string input, the result string is Coercible-default and is assigned either the collation of the current database, or the collation of the database that contains the user-defined function, stored procedure, or trigger in which the function is referenced.

The following functions are collation-sensitive and their output strings have the collation label of the input string:

CHARINDEX, REPLACE, DIFFERENCE, REVERSE, ISNUMERIC, RIGHT, LEFT, SOUNDEX, LEN, STUFF, LOWER, SUBSTRING, PATINDEX, UPPER

Let me suggest to have a look at this or this answers of Solomon Rutzky about UTF-8 and UTF-16 encoding values for VARCHAR and NVARCHAR.

By now you can try by CASTING it to VARCHAR(255) before use CHARINDEX.

I've tried to reproduce it in one of my servers in this way:

DECLARE @t VARBINARY(MAX) = 0x6600640070002E000103650074006900;
DECLARE @vt VARCHAR(MAX);

SELECT CAST(@t AS nvarchar(MAX));

---------
fdp.́eti


SET @vt =  CAST(CAST(@t AS nvarchar(MAX)) AS VARCHAR(MAX))

SELECT REVERSE(LEFT(REVERSE(@vt), CHARINDEX(N'.', REVERSE(@vt))-1));

-------
´eti

Note it returns the accent as a single character.