I have several large tables with many NVARCHAR(MAX)
columns. This database is only for US use and I don't think there are any foreign characters or letters in any of the columns, so I wanted to look into:
- Changing many of these columns over from
NVARCHAR
toVARCHAR
, and also - Resizing these columns to more appropriate sizes based on some data profiling I have done. I know for example that the max length of the data in some of these columns are only around 30 characters so
NVARCHAR(MAX)
seems like overkill.
I am also trying to limit implicit conversions.
What is an easy way to determine before doing this if each column does indeed only have VARCHAR
data in it and won't have any issues with being converted away from NVARCHAR
?
Best Answer
Before you go any further you need to research / determine the actual business requirements of these columns. If it's true that one or more of these columns are not currently holding any characters that wouldn't just as easily fit into
VARCHAR
, that doesn't mean that someone won't try to enter Unicode-only characters tomorrow, and that doing so would be outside of the intended design of the system. If end users / sales people / support people / the documentation believe that Unicode-only characters can be entered into the system, then the system needs to allow for that, even if nobody has made use of that functionality yet. If this is the case and you feel it should be changed because you don't think there is a need to enter such characters, then you need to bring that up to a manager / Product Owner / etc.Also, just because the intended user base is in the US doesn't mean that Unicode-only characters won't be used. At the very least in "name" columns, "description" columns, etc there could easily be a variety of accented characters, or who knows what. The discussion that you should have with the Product Owner and/or team should revolve around what is required for each of these columns / data points.
Similar to the points made above regarding what types of characters should be allowed in these columns, you first need to determine how many characters the system allows people to send in. If a description field is capped at 500 - 1000 characters in the UI, and other parts of the code, and even the documentation (one can always hope, right?), agree, then absolutely yes,
NVARCHAR(MAX)
is absolutely overkill. BUT, if the column needs to store more than 4000 characters of a variety that is not represented by an 8-bit code page, thenNVARCHAR(MAX)
is not overkill (though you might need to consider a better design than storing too many of them in a single table). Either way, the fact that there are only 30 characters max in a particularNVARCHAR(MAX)
column is good information to provide to a Product Owner / team so that a more informed decision can be made regarding the fate of that column. Perhaps a field that currently allows for 1000 characters but has a max record length of 30 - 50 should be updated to only allow maybe 75 - 100. But that needs to be a coordinated effort.All of that said, if you want to determine if there are any characters in an
NVARCHAR
/NCHAR
/NTEXT
column that cannot be converted toVARCHAR
, you need to convert toVARCHAR
using the_BIN2
variation of the collation being used for that particular column. For example, if a particular column is usingAlbanian_100_CI_AS
, then you would specifyAlbanian_100_BIN2
for the test. The reason for using a_BIN2
collation is that non-binary collations will only find instances where there is at least one character that does not have any mapping at all in the code page and is thus converted into?
. But, non-binary collations do not catch instances where there are characters that don't have a direct mapping into the code page, but instead have a "best fit" mapping. For example, the superscript 2 character,²
, has a direct mapping in code page 1252, so definitely no problem there. On the other hand, it doesn't have a direct mapping in code page 1250 (used by the Albanian collations), but it does have a "best fit" mapping which converts it into a regular2
. The problem with the non-binary collation is that2
will equate to²
and so it won't register as row that can't convert toVARCHAR
. For example:Ideally you would convert back to
NVARCHAR
explicitly for the code to be clear on what it's doing, though not doing this will still implicitly convert back toNVARCHAR
, so the behavior is the same either way.