Sql-server – Saving into a VARCHAR(MAX) throwing error in some scenarios

encodingnative-clientsql serverwindows

We are noticing an error like this:

InterfaceError: ('HY104', 0, '[Microsoft][SQL Server Native Client 11.0]
Invalid precision value', 10226)

It happens when we insert a certain length of characters along with Unicode content like \u3000 (Ideographic Space). This happens only we are using the Windows MSSQL Driver and the query is parametrized.

The column is a VARCHAR(MAX). So it should support 2GB.

What could be the reason for this?
Why is it that length matters?

If I just insert this Unicode character alone, it won't cause an issue!

I was trying to see this from a theoretical point of view. Switching to NVARCHAR is not an option for us currently, but removing Unicode is.

Here are some more details as per the questions asked by Solomon Rutzky:

Column Collation is SQL_Latin1_General_CP1_CI_AS
String length where I start seeing the error is 4019.
sqlalchemy.connectors.mxodbc, line 86, in error_handler is where I get the initial error. This is from the traceback.
It works fine with single Unicode character.
A 20000 character string without Unicode works fine.
It makes it ASCII from when I try this from this software (Web App). I tested in Dbeaver by exporting the data.

Best Answer

It would greatly help to know:

the current Collation of the VARCHAR(MAX) column
the exact string length (in bytes) at which you start seeing the error
What software is reporting this error?
Does the error occur if only a single Unicode character is present?
Does a 10,000 character string with no Unicode characters get inserted correctly?
For a short-enough string that gets inserted without error yet contains one of these Unicode characters, what value exactly is in the table? You can check by converting the column to VARBINARY(MAX). Did the Unicode character get translated or something else?

It is possible that the string that errors is just over a 4000 or 8000 character limit that then requires using a MAX type and the driver is initially guessing (or being told) that it is one size and then finding out that it is another. The error message says "invalid precision value" which has to be referring to how the column is being configured (i.e. similar to setting up a SqlParameter in .NET and declaring the max size, such as new SqlParameter("@name", SqlDbType.VarChar, 8000)). So this type of error should only really happen if the software is attempting to set up the column as VARCHAR of over 8000 or NVARCHAR of over 4000, as either of those conditions would be an invalid "precision". Now, in order to get such a number I suspect that some code in there is counting the number of bytes (i.e. DATALENGTH ) of the string, which will be larger than the number of characters. Of course, this would be true of any character in a .NET string or C++ wchar, so I am suspecting that the difference with having some Unicode characters vs not having any is that without any Unicode characters it can convert the string to an 8-bit encoding (i.e. single byte for use with VARCHAR), but somehow having a Unicode character prevents that. It's a long-shot (until there is more info provided), but it is pretty clear where the error is occurring.

If I just insert the character alone it won't cause an issue.

How are you inserting the character? Through the same software that is generating the error, or in SSMS? If you are doing this manually via an INSERT statement, then that isn't a good test as SQL Server converts U+3000 into a regular space, U+0020 (one byte), instead of two questions marks (still 2 bytes, as would happen if there was no equivalent character):

SELECT N'-' + NCHAR(0x3000) + N'-',
       CONVERT(VARCHAR(MAX), N'-' + NCHAR(0x3000) + N'-'),
       CONVERT(VARBINARY(MAX), CONVERT(VARCHAR(MAX), N'-' + NCHAR(0x3000) + N'-'));

returns:

-　-    - -    0x2D202D

If possible, trying passing in both of the strings generated by the following code via the software, not via SSMS:

DECLARE @String4k NVARCHAR(MAX);
SET @String4k = REPLICATE(N'a', 3999) + NCHAR(0x3000);
SELECT LEN(@String4k);

DECLARE @String8k NVARCHAR(MAX);
SET @String8k = REPLICATE(CONVERT(NVARCHAR(MAX), N'a'), 7999);
SET @String8k += NCHAR(0x3000);
SELECT LEN(@String8k);

The @String4k string is 3999 characters that can convert cleanly into an 8-bit encoding (i.e. VARCHAR) plus the U+3000 character that will likely remain as 2 bytes. So maybe this comes across, through that software, as 4001 characters. I doubt that this is the issue, but can't hurt to test.

The @String8k string is 7999 characters that can convert cleanly into an 8-bit encoding (i.e. VARCHAR) plus the U+3000 character that will likely remain as 2 bytes. So maybe this comes across, through that software, as 8001 characters.

Related Solutions

PostgreSQL encoding conversion error

know that I should migrate my database to utf8 to solve this problem, but for some reasons, I can not do that for the moment.

In my case, I'd rather PostgreSQL saves my string removing characters it can not convert or for example replacing them with some symbol like "?" rather than throwing an error...

PostgreSQL does not support this. It's requested periodically, but nobody who requests it does the work to actually implement it in the system and convince the dev team it's an appropriate option to offer.

You will need to do your text-mangling client-side. In PHP, before you send the text to PostgreSQL, you will need to filter out characters that doesn't match the database encoding. How to do that is entirely PHP-specific (start with iconv support, probably). You have described one way to do this, using utf8_decode, already.

Using utf8_decode is actually incorrect, because the function (per the docs) actually assumes the input is ISO-8859-1, i.e. Latin-1. You're using latin-9, i.e. ISO-8859-15. So it'll mangle some of your input characters, in particular the Euro sign. See changes from ISO-8859-1. Instead, use the iconv function. See the surprisingly useful comments on the utf8_decode function documentation.

If in the process of filtering the text you convert it to LATIN9 inside PHP, remember that you must set your client_encoding to latin9, since that's the encoding of the text you'll be sending to PostgreSQL. That means the results will be in latin-9 too, so you must convert all results from PostgreSQL from latin-9 back to PHP's native utf-8.

If you use utf8_encode to convert your latin-9 output from PostgreSQL for consumption in PHP, you'll have the same problem with latin-1 vs latin-9 as you do on utf8_decode.

For that reason, if possible, try to use a filter that replaces characters not supported in latin-9 without actually converting the string to latin-9. It'll save you a bunch of hassle if you can keep client_encoding set to utf-8 and just mangle your strings instead of converting them.

All this said, I strongly recommend upgrading the database to utf-8 instead. The only reason to keep it in latin-9 would be if you have other client applications that can't cope with chars outside the latin-9 range (i.e. they rely on a latin-9 client_encoding)

Sql-server – Why these characters are all equal in SQL Server

All character data in SQL Server is associated with a collation, which determines the domain of characters that can be stored as well as the rules used to compare and sort data. Collation applies to both Unicode and Non-Unicode data.

SQL Server includes 3 broad categories of collations: binary, legacy, and Windows. Collations in the binary category (_BIN suffix) use the underlying code points to compare so equality comparisons return not-equal if the code points differ regardless of the character. Legacy (SQL_ prefix) and Windows collations provide sorting and comparison semantics for the more natural dictionary rules. This allows comparisons to consider case, accents, width, and Kana. Windows collations provide more robust word-sort rules that closely align with the Windows OS whereas legacy collations consider only single characters.

The example below illustrates the differences between Windows and binary collation with the Teth character:

CREATE TABLE dbo.WindowsColationExample
    (
      Character1 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character2 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character3 nchar(1) COLLATE Arabic_100_CI_AS_SC
    , Character4 nchar(1) COLLATE Arabic_100_CI_AS_SC
    );

CREATE TABLE dbo.BinaryColationExample
    (
      Character1 nchar(1) COLLATE Arabic_100_BIN
    , Character2 nchar(1) COLLATE Arabic_100_BIN
    , Character3 nchar(1) COLLATE Arabic_100_BIN
    , Character4 nchar(1) COLLATE Arabic_100_BIN
    );

INSERT  INTO dbo.BinaryColationExample
VALUES  ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );
INSERT  INTO dbo.WindowsColationExample
VALUES  ( NCHAR(65217), NCHAR(65218), NCHAR(65219), NCHAR(65220) );

--all characters compare not equal
SELECT *
FROM dbo.BinaryColationExample
WHERE
    character1 = character2
    OR character1 = character3
    OR character1 = character4
    OR character2 = character3
    OR character2 = character4
    OR character3 = character4;

--all characters compare equal
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character2;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character1 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character3;
SELECT *
FROM dbo.WindowsColationExample
WHERE character2 = character4;
SELECT *
FROM dbo.WindowsColationExample
WHERE character3 = character4;

Reasons as to why Unicode may contain different code points for identical glyphs are outlined in http://en.wikipedia.org/wiki/Duplicate_characters_in_Unicode. I summary, it may be for legacy compatibility or the characters are not canonically equivalent. Note that the Teth character ﻁ is used in different languages (http://en.wikipedia.org/wiki/Teth).

Best Answer

Related Solutions

PostgreSQL encoding conversion error

Sql-server – Why these characters are all equal in SQL Server

Related Question