This sounds vaguely familiar. Oh yea, I wrote about this back on Aug 14, 2012 : error 1118: row size too large. utf8 innodb. Here is additional information from a Percona Expert on this subject : InnoDB create table error: “Row size too large”
I can see this putting a strain on an InnoDB table in terms of row length limits. Turning TEXT fields of ominous lengths into VARCHARs or even CHARs in InnoDB will have no effect (at least, no positive effects).
Your concern is quite valid and merits some respect. Why?
Some would do one of the following:
- Perform the conversion and deal with (mostly complain about) worse consequences
- Perform the conversion in a Development server, see the effects, and judge accordingly
- Ask why, do research, carry out due diligence in a Development server
It sounds like you did Option #3 from the above list. Your conclusion, given the lengths of data, is correct for your given dataset.
Is it safe to assume columns of NVARCHAR, NTEXT, NCHAR, BIT, INT, DECIMAL, FLOAT, and DATETIME all MUST be UNICODE...
Only the XML
and N
-prefixed types (NCHAR
, NVARCHAR
, and NTEXT
[which has been deprecated since SQL Server 2005 was released so please do not use it]) are Unicode. Those other types you mentioned are not strings and are not stored as strings, hence they are not relevant to this question.
... and therefore WILL NOT have any characters unable to be converted from UTF-16LE to UTF-8...
This is not exactly a valid question. Unicode characters are Unicode characters regardless of their encoding, whether it is UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE. Now, it is possible for the data itself to contain invalid sequences, such as invalid Surrogate Pairs. But then those aren't valid characters in the UTF-8 or UTF-32 encodings either.
... simply by exporting the values to a .txt file and resaving them with the UTF-8 Encoding prior to importing them to MySQL?
Well, you need to be sure to save the initial export file with a Unicode-encoding. So you would use either the -N
or -w
options with BCP.
Also, make sure that you are doing more than just changing the Byte Order Mark (BOM) of the file and are actually converting the Unicode / UT-16LE characters to UTF-8.
Best Answer
The '?' character replaced the original value because there was no equivalent character in the collation code page and is physically stored in the column instead of the original value.
Sorry to say but the original value is lost.