Sql-server – Converting columns from NVARCHAR to VARCHAR

datatypessql serversql-server-2016type conversionvarchar

I have several large tables with many NVARCHAR(MAX) columns. This database is only for US use and I don't think there are any foreign characters or letters in any of the columns, so I wanted to look into:

  1. Changing many of these columns over from NVARCHAR to VARCHAR, and also
  2. Resizing these columns to more appropriate sizes based on some data profiling I have done. I know for example that the max length of the data in some of these columns are only around 30 characters so NVARCHAR(MAX) seems like overkill.

I am also trying to limit implicit conversions.

What is an easy way to determine before doing this if each column does indeed only have VARCHAR data in it and won't have any issues with being converted away from NVARCHAR?

Best Answer

This database is only for US use and I don't think there are any foreign characters or letters in any of the columns so I wanted to look into ... Changing many of these columns over from NVARCHAR to VARCHAR

Before you go any further you need to research / determine the actual business requirements of these columns. If it's true that one or more of these columns are not currently holding any characters that wouldn't just as easily fit into VARCHAR, that doesn't mean that someone won't try to enter Unicode-only characters tomorrow, and that doing so would be outside of the intended design of the system. If end users / sales people / support people / the documentation believe that Unicode-only characters can be entered into the system, then the system needs to allow for that, even if nobody has made use of that functionality yet. If this is the case and you feel it should be changed because you don't think there is a need to enter such characters, then you need to bring that up to a manager / Product Owner / etc.

Also, just because the intended user base is in the US doesn't mean that Unicode-only characters won't be used. At the very least in "name" columns, "description" columns, etc there could easily be a variety of accented characters, or who knows what. The discussion that you should have with the Product Owner and/or team should revolve around what is required for each of these columns / data points.

I know for example that the max length of the data in some of these columns are only around 30 characters so NVARCHAR(MAX) seems like overkill.

Similar to the points made above regarding what types of characters should be allowed in these columns, you first need to determine how many characters the system allows people to send in. If a description field is capped at 500 - 1000 characters in the UI, and other parts of the code, and even the documentation (one can always hope, right?), agree, then absolutely yes, NVARCHAR(MAX) is absolutely overkill. BUT, if the column needs to store more than 4000 characters of a variety that is not represented by an 8-bit code page, then NVARCHAR(MAX) is not overkill (though you might need to consider a better design than storing too many of them in a single table). Either way, the fact that there are only 30 characters max in a particular NVARCHAR(MAX) column is good information to provide to a Product Owner / team so that a more informed decision can be made regarding the fate of that column. Perhaps a field that currently allows for 1000 characters but has a max record length of 30 - 50 should be updated to only allow maybe 75 - 100. But that needs to be a coordinated effort.

All of that said, if you want to determine if there are any characters in an NVARCHAR / NCHAR / NTEXT column that cannot be converted to VARCHAR, you need to convert to VARCHAR using the _BIN2 variation of the collation being used for that particular column. For example, if a particular column is using Albanian_100_CI_AS, then you would specify Albanian_100_BIN2 for the test. The reason for using a _BIN2 collation is that non-binary collations will only find instances where there is at least one character that does not have any mapping at all in the code page and is thus converted into ?. But, non-binary collations do not catch instances where there are characters that don't have a direct mapping into the code page, but instead have a "best fit" mapping. For example, the superscript 2 character, ², has a direct mapping in code page 1252, so definitely no problem there. On the other hand, it doesn't have a direct mapping in code page 1250 (used by the Albanian collations), but it does have a "best fit" mapping which converts it into a regular 2. The problem with the non-binary collation is that 2 will equate to ² and so it won't register as row that can't convert to VARCHAR. For example:

SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE French_100_CI_AS); -- Code Page 1252
-- ²
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS); -- Code Page 1250
-- 2

SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS)
WHERE  N'²' <> CONVERT(NVARCHAR(MAX),
                       CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS));
-- (no rows returned)

SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2)
WHERE  N'²' <> CONVERT(NVARCHAR(MAX),
                       CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2));
-- 2

Ideally you would convert back to NVARCHAR explicitly for the code to be clear on what it's doing, though not doing this will still implicitly convert back to NVARCHAR, so the behavior is the same either way.