Sql-server – Detect if any values in NVARCHAR columns are actually unicode

datatypessql serverunicode

I have inherited some SQL Server databases. There is one table (I'll call "G"), with about 86.7 million rows, and 41 columns wide, from a source database (I'll call "Q") on SQL Server 2014 Standard that gets ETL'd over to a target database (I'll call "P") with the same table name on SQL Server 2008 R2 Standard.

i.e.
[Q].[G] —> [P].[G]

EDIT: 3/20/2017:
Some people have asked if the source table is the ONLY source to the target table. Yes, it is the only source. As far as the ETL goes, there isn't any real transformation happening; it effectively is intended to be a 1:1 copy of the source data. Therefore, there are no plans to add additional sources to this target table.

A little over half of the columns in [Q].[G] are VARCHAR (source table):

13 of the columns are VARCHAR(80)
9 of the columns are VARCHAR(30)
2 of the columns are VARCHAR(8).

Similarly, the same columns in [P].[G] are NVARCHAR (target table), with the same # of columns with the same widths. (In other words, same length, but NVARCHAR).

13 of the columns are NVARCHAR(80)
9 of the columns are NVARCHAR(30)
2 of the columns are NVARCHAR(8).

This is not my design.

I'd like to ALTER [P].[G] (target) columns data types from NVARCHAR to VARCHAR.
I want to do it safely (without data loss from conversion).

How can I look at the data values in each NVARCHAR column in the target table to confirm whether or not the column actually contains any Unicode data?

A query (DMVs?) that can check each value (in a loop?) of each NVARCHAR column and tell me if ANY of the values is genuine Unicode would be the ideal solution, but other methods are welcome.

Best Answer

Suppose one of your columns does not contain any unicode data. To verify that you would need to read the column value for every row. Unless you have an index on the column, with a rowstore table you will need to read every data page from the table. With that in mind I think it makes a lot of sense to combine all of the column checks into a single query against the table. That way you won't be reading the table's data many times and you don't have to code a cursor or some other kind of loop.

To check a single column believe that you can just do this:

SELECT COLUMN_1
FROM [P].[Q]
WHERE CAST(COLUMN_1 AS VARCHAR(80)) <> CAST(COLUMN_1 AS NVARCHAR(80));

A cast from NVARCHAR to VARCHAR should give you the same result except if there are unicode characters. Unicode characters will be converted to ?. So the above code should handle NULL cases correctly. You have 24 columns to check, so you check each column in a single query by using scalar aggregates. One implementation is below:

SELECT 
  MAX(CASE WHEN CAST(COLUMN_1 AS VARCHAR(80)) <> CAST(COLUMN_1 AS NVARCHAR(80)) THEN 1 ELSE 0 END) COLUMN_1_RESULT
...
, MAX(CASE WHEN CAST(COLUMN_14 AS VARCHAR(30)) <> CAST(COLUMN_14 AS NVARCHAR(30)) THEN 1 ELSE 0 END) COLUMN_14_RESULT
...
, MAX(CASE WHEN CAST(COLUMN_23 AS VARCHAR(8)) <> CAST(COLUMN_23 AS NVARCHAR(8)) THEN 1 ELSE 0 END) COLUMN_23_RESULT
FROM [P].[Q];

For each column you will get a result of 1 if any of its values contain unicode. A result of 0 means that all data can be safely converted.

I strongly recommend making a copy of the table with the new column definitions and copying your data there. You'll be doing expensive conversions if you do it in place so making a copy might not be all that much slower. Having a copy means that you can easily validate that all of the data is still there (one way is to use the EXCEPT keyword) and you can undo the operation very easily.

Also, be aware that you might not have any unicode data currently it's possible that a future ETL could load unicode into a previously clean column. If there is not a check for this in your ETL process you should consider adding that before doing this conversion.

Related Solutions

Sql-server – Query table A for any rows that are like any search term in table B

...
FROM
  TableA
  JOIN
  TableB ON
          TableA.FirstName LIKE '%' + TableB.SearchTerm + '%' OR 
          TableA.LastName LIKE '%' + TableB.SearchTerm + '%'
....

Now, this isn't going to be the best of queries. Consider full text search instead (I'm not too familiar with it)

This OR could be split into a UNION

Sql-server – Can Unicode columns include non-Unicode values in SQL Server

Is it safe to assume columns of NVARCHAR, NTEXT, NCHAR, BIT, INT, DECIMAL, FLOAT, and DATETIME all MUST be UNICODE...

Only the XML and N-prefixed types (NCHAR, NVARCHAR, and NTEXT [which has been deprecated since SQL Server 2005 was released so please do not use it]) are Unicode. Those other types you mentioned are not strings and are not stored as strings, hence they are not relevant to this question.

... and therefore WILL NOT have any characters unable to be converted from UTF-16LE to UTF-8...

This is not exactly a valid question. Unicode characters are Unicode characters regardless of their encoding, whether it is UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE. Now, it is possible for the data itself to contain invalid sequences, such as invalid Surrogate Pairs. But then those aren't valid characters in the UTF-8 or UTF-32 encodings either.

... simply by exporting the values to a .txt file and resaving them with the UTF-8 Encoding prior to importing them to MySQL?

Well, you need to be sure to save the initial export file with a Unicode-encoding. So you would use either the -N or -w options with BCP.

Also, make sure that you are doing more than just changing the Byte Order Mark (BOM) of the file and are actually converting the Unicode / UT-16LE characters to UTF-8.

Best Answer

Related Solutions

Sql-server – Query table A for any rows that are like any search term in table B

Sql-server – Can Unicode columns include non-Unicode values in SQL Server

Related Question