Is it safe to assume columns of NVARCHAR, NTEXT, NCHAR, BIT, INT, DECIMAL, FLOAT, and DATETIME all MUST be UNICODE...
Only the XML
and N
-prefixed types (NCHAR
, NVARCHAR
, and NTEXT
[which has been deprecated since SQL Server 2005 was released so please do not use it]) are Unicode. Those other types you mentioned are not strings and are not stored as strings, hence they are not relevant to this question.
... and therefore WILL NOT have any characters unable to be converted from UTF-16LE to UTF-8...
This is not exactly a valid question. Unicode characters are Unicode characters regardless of their encoding, whether it is UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE. Now, it is possible for the data itself to contain invalid sequences, such as invalid Surrogate Pairs. But then those aren't valid characters in the UTF-8 or UTF-32 encodings either.
... simply by exporting the values to a .txt file and resaving them with the UTF-8 Encoding prior to importing them to MySQL?
Well, you need to be sure to save the initial export file with a Unicode-encoding. So you would use either the -N
or -w
options with BCP.
Also, make sure that you are doing more than just changing the Byte Order Mark (BOM) of the file and are actually converting the Unicode / UT-16LE characters to UTF-8.
There are some issues with this request:
What version and edition of SQL Server is the target system?
How much data is being imported? 10k, 10 Mb, more?
How many CSV files are there?
You have stated that handling of double-quotes is required, implying text-qualified fields and embedded text-qualifiers. Will there also be embedded field delimiters (i.e. ,
)? Those usually aren't a problem, but will there also be embedded row delimiters (i.e. \r\n
or just \n
)? These are the biggest problem for most CSV parsers.
In a comment on Scott's answer you raise a concern about: "Double quotes is one example and I expect there are other shortcomings if such a basic feature is missing - unicode, size limits, etc.". Can you please update the question to include these and any other missing requirements?
In a comment on the question, you ask, regarding SSMS: "will this create the required tables or they must be predefined?". Is creating the tables a requirement of this import process? If so, or even if creating the tables is an optional benefit, can it please be stated clearly in the question?
If the tables do not already exist, what needs to happen with regards to Foreign Keys, Indexes, Default Constraints, Check Constraints, etc?
What collation is being used, or at least is desired for the SQL Server tables? Is the plan to simply inherit the current default collation for the database that you are importing into? And to be clear about the term "collation" since it can have slightly different meaning depending on the system you are using, I am speaking of: Locale / LCID / Culture / Code Page (if applicable). It would also help to know if the source is sensitive or insensitive for case, accents, etc, or if a binary collation is being used.
Is there a more reliable intermediary format that could be generated from Linux/OSX? For example MySQL dump, Excel, Sqlite, XML, etc?
The only truly reliable format will be something generated by SQL Server. XML is generally very reliable for transporting the data (this is what XML was meant to do), and doesn't have the parsing issues that are inherent in delimited files (i.e. embedded text qualifiers and delimiters). BUT, you still need to create the tables, and you need to write, and test the parsing of the XML into a result set so that it can be inserted into the tables. HOWEVER, (continued in the next item)...
You stated in that comment on Scott's answer:
Problem is I won't be able to test the import because I don't have direct access to the MS SQL database, so am looking for a more reliable approach.
"Reliability" can only be determined through testing. It doesn't matter what should work, anything can go wrong. For example, many folks are not aware that it is common for XML to have an encoding of UTF-8, yet SQL Server only handles UTF-16 (Little Endian) for XML or even NVARCHAR data. Seeing as how this data is coming from "Linux/OSX", I would expect the default encoding to be UTF-8.
The encoding issue and other nuances should all (well, "mostly") reveal themselves in testing, but you have no way to test. Hence it will be difficult to get a very reliable answer regarding a reliable import mechanism.
What is the source of the data? I assume it is either Linux or OSX as those were mentioned in the question. But is it coming from MySQL specifically (since a "MySQL dump" was also mentioned)? Knowing if the source is MySQL or PostgreSQL or flat files, etc will help determine what tools are both available and best suited to this request.
Regardless of the source, however, keep in mind that the destination (i.e. Windows / SQL Server) is most "comfortable" dealing with UTF-16 (Little Endian) encoded data. So while UTF-8 might be the default output type for Linux/OSX-based systems, if there is an option for using UTF-16 Little Endian for the output encoding, that will help reduce potential issues once the script is moved over to the destination system.
One last thing to consider: You mention "MySQL dump" as a potential "reliable format", and those dumps include the DDL (i.e. CREATE TABLE
) and DML (i.e. INSERT
) statements. If you have the ability to write a script that contains the DDL statements (if necessary) and DML statements, then you don't have to worry about any formatting issues. Just convert the CSV data into INSERT
statements and be done. The only issue you would be left with is not being able to test the scripts. But still, outside of which type of quote or brackets to use for object names and string literals, it's really hard to go wrong with:
INSERT INTO table (column1_name, column2_name, ...) VALUES (column1_val, column2_val, ...);
And to make things even easier:
- Enclose table and column names in square brackets:
[
and ]
(e.g. [Customers]
), or double-quotes: "
(e.g. "Customers"
)
- Enclose string, date, and GUID literals in regular single-quotes / apostrophes:
'
. (e.g. 'some text'
)
- If any string literals are for Unicode /
NVARCHAR
/ XML
columns, prefix those literals with a capital-"N": N
(e.g. N'some unicode text'
)
- Make sure that any date / datetime values are formatted in a compatible way (there are several options and some depend on the current language setting in SQL Server, but generally
YYYY-MM-DD HH:mm:ss.mmm
should be fine)
- Max size/length for CHAR / VARCHAR and BINARY / VARBINARY columns is 8000. Anything larger will need to use
VARCHAR(MAX)
and VARBINARY(MAX)
, respectively.
- Max size/length for NCHAR / NVARCHAR columns (i.e. Unicode) is 4000. Anything larger will need to use
NVARCHAR(MAX)
.
Since MySQL was mentioned in the question, even if indirectly, I figured it wouldn't hurt to test using mysqldump to see how close it can actually get. I used the following options:
--flush-logs
--compatible=mssql
--no-create-db
--no-tablespaces
--order-by-primary
--default-character-set=utf8
--flush-logs
--skip-set-charset
--hex-blob
--result-file
--skip-add-locks
--skip-triggers
--skip-comments
--skip-disable-keys
--compact
--skip-extended-insert
--skip-add-drop-table
--quote-names
Some notes:
- I used
utf8
as the charset since utf16le
, ucs2
, and a few others are not allowed in this particular context :-(
- The bottom two options --
--skip-add-drop-table
and --quote-names
-- might not need to be explicitly stated as they might be implied by --compatible=mssql
- At minimum you will need to do some string replacements on the output file to adjust the syntax for SQL Server. In the list below, pay close attention to the space before, and sometimes after, the "before" and "after" strings as they are necessary to prevent matching parts of valid strings.
- " int(11) " -> " INT "
- " timestamp " -> " DATETIME "
- " text " -> " VARCHAR(MAX) "
- " blob " -> " VARBINARY(MAX) "
- " DEFAULT NULL" -> " NULL" (no space at the end of either the "before" or "after" strings)
- After making the 5 adjustments noted directly above, you are still left with a few potential issues:
- Did any of the string replacements above accidentally match text within any string columns?
- If there are any
varchar
columns with a length of 8001 or more, those lengths will need to be translated to the string MAX
so that the resulting data type will be VARCHAR(MAX)
instead of VARCHAR(20000)
.
- If you have any Unicode data, especially any characters that do not fit into the Code Page of the default collation of the destination database, it will get converted into question marks (
?
) unless you can find a way to prefix those string literals with a capital-N
. Along these same lines, the CREATE TABLE
statements will need to be adjusted for any Unicode columns such that they are NCHAR
/ NVARCHAR
instead of CHAR
/ VARCHAR
. Non-Unicode columns can remain as CHAR
/ VARCHAR
.
- potentially other nuances / differences that I have missed.
P.S. For the sake of completeness as it relates to migrating data from MySQL to SQL Server, Microsoft does have a Migration Assistant, but from what I can tell so far it requires a connection from SQL Server to MySQL and will not work on a dump file.
UPDATE
New info (from comment on this answer):
The source is scraped data from German web pages currently stored in a CSV file. I could easily import it from there into MySQL / Postgres if that was a reliable vector.
So here's a thought. If you can get this data into one RDBMS, then why not another? Assuming you were going to set up a temporary Windows VM for testing anyway, you could just:
- Install SQL Server 2008 Express Edition (i.e. free) on the VM
- Create a database to hold this data (it will make life much easier if you can find out from your client what default collation their database is using and use the same one for this new database!)
- Create the intended schema (tables, etc) for this data
- Import this data directly into these tables
- Take a
BACKUP
of the this database
- Create another database that will act as your client's database
- Write a script that will:
RESTORE
the backup file into a new database with a new name (not the original name)
- copy the tables / data from the new database into the "customer" database
DROP
the new / data migration database
- TEST!
- Give the backup file and the script that makes use of it to your client.
While this does require that you write a T-SQL script, it does not require any additional software or install, and the data is already in columns of the proper datatypes and doesn't need to be parsed :-).
P.S. Again, for the sake of completeness I was also looking at SQL Server Data Tools (SSDT) as a possible solution as it can export schema and data into a single .dacpac file and import that into a destination. However, this does require installing SSDT. But, you wouldn't have to write a T-SQL script as SSDT handles all of the SQL for you.
Best Answer
Notes
There are several things to go over here:
Just to get this out of the way: gz[ip] and 7zip.exe have nothing to do with any of this. Compression does not alter encoding or any of the original bytes (else compression / decompression wouldn't be reliable). Well, theoretically there can be a bug in one of them such that the decompressed output is somehow slightly different, but that would be a wide-spread problem, I would think, and these algorithms, and these two tools in particular, have been around for a long time and are known to be reliable.
Keep in mind: files, even text files, contain bytes, not characters. Those bytes can represent some set of characters, or another set, or something else. But in the end, it is just a collection of bytes. So if you see
Ã
, that is not because there is anÃ
in the file, it is due to there by one or more bytes that are currently being interpreted as representing anÃ
. It could be that those same bytes should truly be interpreted as representing something else. And, at the same time, it could also be that some other sequence of bytes, under a different interpretation, also representÃ
.Ok, this right here should raise a red flag: why would "simply copying a file" produce an encoding error? There can only be an encoding error if the file is being read, else it's just a collection of bytes moving from point A to point B (or perhaps it's merely being relinked and not moving, but either way).
Here is another key indicator (that is likely being misinterpreted by some): the error is a UTF-8 encoding error. This tells us two things:
é
characters are UTF-8 encoded bytes already (i.e.0xC383C2A9
), and not Windows-1252 bytes (0xC3A9
) that should instead be interpreted as UTF-8 to produceé
.ée
is not the error, because if it was the error, then it wouldn't be able to display it. An encoding error is when a byte sequence can't produce a character in that encoding. Meaning: there is something missing from theée
sequence. Theée
sequence should be the remaining parts of what could be decoded from the bytes in the file. Keep in mind that Unicode is a specification, not software. The implementation is left up to each vendor. There are guidelines regarding compliant implementations, and those include how to handle decoding errors. But, there is some flexibility as well. So some implementations might throw a hard-error in this case, or others might display the "replacement" character, �, or some might display nothing in that position. In either case, theée
sequence, taken literally, is most likely a red-herring and not what we should be looking at.Ok. So, in Notepad++, when it shows "UTF-8" on the right side of the bottom bar upon opening a file, that is not necessarily a guarantee that the file is, in fact, encoded as UTF-8. That is a best-guess, probably based on common byte sequences. If the encoding indicator instead showed "UTF-8-BOM", then that would be a guarantee that the file was encoded as UTF-8. The "-BOM" indicates that a Byte Order Mark is present. a BOM is an optional 2 - 4 byte sequence at the beginning of a file that should be hidden but indicates the encoding of the file. This is only available for Unicode. There are two 2-byte BOMs for UTF-16, one for Little Endian and one for Big Endian. There are also two 4-byte BOMs for UTF-32, one for each Endianness. UTF-8 only has a single 3-byte BOM since Endianness does not apply to UTF-8 (since the code units are a single byte, there is only one order that the byte can be in). Of course, having a BOM does not guarantee that it will be respected or even understood by whatever is reading the file. In the case presented here, having a BOM should not have any affect on the outcome since the file is already being read as a UTF-8 file. And again, if Notepad++ is showing
ée
and shows "UTF-8" in the lower-right corner, then it's already UTF-8 instead of Windows-1252 / ISO-8859-1 that needs to be told to be read as UTF-8 in order to produceé
.and:
So, is
ée
theentire
value for that field? Is it within double-quotes in the file?ée
does not appear to be a URL ;-)Also, does the value shown in the error match, exactly, the value you see in the file? If so, there is likely a hidden character in the file that is causing this problem.
What do you mean by "ASCII" characters? One of the features of UTF-8 is that the encoding of the first 128 code points (U+0000 through U+007F) are identical to standard ASCII. This was the primary design goal of UTF-8: full ACSII compatibility (not 8-bit Extended ASCII, just 7-bit standard ASCII). So, if the majority of the document is using US English characters and punctuation, then yes, you should expect to see that just the same whether you open the file as ASCII or UTF-8.
This is not surprising. I don't think it's possible to get an ANSI / ASCII encoding error. There are only a few code points / values that are undefined in code page Windows-1252, and they generally behave as "hidden" characters.
Of course, if you do have valid UTF-8 encoded characters in that file, converting to ANSI will change them into "?" if they are not available in Windows-1252.
Take-aways
é
indicated a byte sequence of0xC3A9
, then there would be no UTF-8 encoding error since the byte sequence of0xC3A9
is valid UTF-8 foré
.ée
isn't the error, but instead a result of the errorBottom-line is: there is too much guessing / wild-goose-chase without seeing the original bytes of the line that has the encoding error. Having that line will help determine both a) what happened, and b) how to proceed.
If the poster of the bounty could add a comment to this answer providing additional details, especially the exact byte sequence that is getting the error, that would help solve this.