Sql-server – Detect if any values in NVARCHAR columns are actually unicode

datatypessql serverunicode

I have inherited some SQL Server databases. There is one table (I'll call "G"), with about 86.7 million rows, and 41 columns wide, from a source database (I'll call "Q") on SQL Server 2014 Standard that gets ETL'd over to a target database (I'll call "P") with the same table name on SQL Server 2008 R2 Standard.

i.e.
[Q].[G] —> [P].[G]

EDIT: 3/20/2017:
Some people have asked if the source table is the ONLY source to the target table. Yes, it is the only source. As far as the ETL goes, there isn't any real transformation happening; it effectively is intended to be a 1:1 copy of the source data. Therefore, there are no plans to add additional sources to this target table.

A little over half of the columns in [Q].[G] are VARCHAR (source table):

  • 13 of the columns are VARCHAR(80)
  • 9 of the columns are VARCHAR(30)
  • 2 of the columns are VARCHAR(8).

Similarly, the same columns in [P].[G] are NVARCHAR (target table), with the same # of columns with the same widths. (In other words, same length, but NVARCHAR).

  • 13 of the columns are NVARCHAR(80)
  • 9 of the columns are NVARCHAR(30)
  • 2 of the columns are NVARCHAR(8).

This is not my design.

I'd like to ALTER [P].[G] (target) columns data types from NVARCHAR to VARCHAR.
I want to do it safely (without data loss from conversion).

How can I look at the data values in each NVARCHAR column in the target table to confirm whether or not the column actually contains any Unicode data?

A query (DMVs?) that can check each value (in a loop?) of each NVARCHAR column and tell me if ANY of the values is genuine Unicode would be the ideal solution, but other methods are welcome.

Best Answer

Suppose one of your columns does not contain any unicode data. To verify that you would need to read the column value for every row. Unless you have an index on the column, with a rowstore table you will need to read every data page from the table. With that in mind I think it makes a lot of sense to combine all of the column checks into a single query against the table. That way you won't be reading the table's data many times and you don't have to code a cursor or some other kind of loop.

To check a single column believe that you can just do this:

SELECT COLUMN_1
FROM [P].[Q]
WHERE CAST(COLUMN_1 AS VARCHAR(80)) <> CAST(COLUMN_1 AS NVARCHAR(80));

A cast from NVARCHAR to VARCHAR should give you the same result except if there are unicode characters. Unicode characters will be converted to ?. So the above code should handle NULL cases correctly. You have 24 columns to check, so you check each column in a single query by using scalar aggregates. One implementation is below:

SELECT 
  MAX(CASE WHEN CAST(COLUMN_1 AS VARCHAR(80)) <> CAST(COLUMN_1 AS NVARCHAR(80)) THEN 1 ELSE 0 END) COLUMN_1_RESULT
...
, MAX(CASE WHEN CAST(COLUMN_14 AS VARCHAR(30)) <> CAST(COLUMN_14 AS NVARCHAR(30)) THEN 1 ELSE 0 END) COLUMN_14_RESULT
...
, MAX(CASE WHEN CAST(COLUMN_23 AS VARCHAR(8)) <> CAST(COLUMN_23 AS NVARCHAR(8)) THEN 1 ELSE 0 END) COLUMN_23_RESULT
FROM [P].[Q];

For each column you will get a result of 1 if any of its values contain unicode. A result of 0 means that all data can be safely converted.

I strongly recommend making a copy of the table with the new column definitions and copying your data there. You'll be doing expensive conversions if you do it in place so making a copy might not be all that much slower. Having a copy means that you can easily validate that all of the data is still there (one way is to use the EXCEPT keyword) and you can undo the operation very easily.

Also, be aware that you might not have any unicode data currently it's possible that a future ETL could load unicode into a previously clean column. If there is not a check for this in your ETL process you should consider adding that before doing this conversion.