SQL Server Export – Export to Format Compatible with SQL Server

csvimportlinuxmac os xsql server

I need to provide data in a format that could be imported into a Microsoft SQL Server database; however, I do not have Windows or access to a SQL Server instance.

The data is currently in several CSV spreadsheets. I looked at BULK INSERT but found their support for CSV is quite limited, such as not recognizing double quotes (documentation: Specifying Field and Row Terminators). I have read that edge cases like when a field happens to include a quote (and so needs to be escaped) are not supported.

Is there a more reliable intermediary format that could be generated from Linux/OSX? For example MySQL dump, Excel, Sqlite, XML, etc?

I also need this import process to create the tables and not require coding. The person importing is non-technical. It would be a one-off process.

The target system is SQL Server 2008, though I am ideally interested in a solution not tied to a specific version.
The data to import is ~100MB in a single CSV file.
There are embedded field/row delimiters, which works fine when the fields are quoted with the Python csv module.
There is no need to handle foreign keys, indexes, constraints, etc.
The encoding is UTF-8.
The source is scraped data from German web pages currently stored in a CSV file. I could easily import it from there into MySQL / Postgres if that was a reliable vector.

I won't be able to test the import because I don't have direct access to the SQL Server database, so am looking for a more reliable approach. For example a MySQL dump, XLS, XML, etc.

Best Answer

There are some issues with this request:

What version and edition of SQL Server is the target system?
How much data is being imported? 10k, 10 Mb, more?
How many CSV files are there?
You have stated that handling of double-quotes is required, implying text-qualified fields and embedded text-qualifiers. Will there also be embedded field delimiters (i.e. , )? Those usually aren't a problem, but will there also be embedded row delimiters (i.e. \r\n or just \n )? These are the biggest problem for most CSV parsers.
In a comment on Scott's answer you raise a concern about: "Double quotes is one example and I expect there are other shortcomings if such a basic feature is missing - unicode, size limits, etc.". Can you please update the question to include these and any other missing requirements?
In a comment on the question, you ask, regarding SSMS: "will this create the required tables or they must be predefined?". Is creating the tables a requirement of this import process? If so, or even if creating the tables is an optional benefit, can it please be stated clearly in the question?
If the tables do not already exist, what needs to happen with regards to Foreign Keys, Indexes, Default Constraints, Check Constraints, etc?
What collation is being used, or at least is desired for the SQL Server tables? Is the plan to simply inherit the current default collation for the database that you are importing into? And to be clear about the term "collation" since it can have slightly different meaning depending on the system you are using, I am speaking of: Locale / LCID / Culture / Code Page (if applicable). It would also help to know if the source is sensitive or insensitive for case, accents, etc, or if a binary collation is being used.
Is there a more reliable intermediary format that could be generated from Linux/OSX? For example MySQL dump, Excel, Sqlite, XML, etc?

The only truly reliable format will be something generated by SQL Server. XML is generally very reliable for transporting the data (this is what XML was meant to do), and doesn't have the parsing issues that are inherent in delimited files (i.e. embedded text qualifiers and delimiters). BUT, you still need to create the tables, and you need to write, and test the parsing of the XML into a result set so that it can be inserted into the tables. HOWEVER, (continued in the next item)...
You stated in that comment on Scott's answer:

Problem is I won't be able to test the import because I don't have direct access to the MS SQL database, so am looking for a more reliable approach.

"Reliability" can only be determined through testing. It doesn't matter what should work, anything can go wrong. For example, many folks are not aware that it is common for XML to have an encoding of UTF-8, yet SQL Server only handles UTF-16 (Little Endian) for XML or even NVARCHAR data. Seeing as how this data is coming from "Linux/OSX", I would expect the default encoding to be UTF-8.

The encoding issue and other nuances should all (well, "mostly") reveal themselves in testing, but you have no way to test. Hence it will be difficult to get a very reliable answer regarding a reliable import mechanism.
What is the source of the data? I assume it is either Linux or OSX as those were mentioned in the question. But is it coming from MySQL specifically (since a "MySQL dump" was also mentioned)? Knowing if the source is MySQL or PostgreSQL or flat files, etc will help determine what tools are both available and best suited to this request.
Regardless of the source, however, keep in mind that the destination (i.e. Windows / SQL Server) is most "comfortable" dealing with UTF-16 (Little Endian) encoded data. So while UTF-8 might be the default output type for Linux/OSX-based systems, if there is an option for using UTF-16 Little Endian for the output encoding, that will help reduce potential issues once the script is moved over to the destination system.

One last thing to consider: You mention "MySQL dump" as a potential "reliable format", and those dumps include the DDL (i.e. CREATE TABLE) and DML (i.e. INSERT) statements. If you have the ability to write a script that contains the DDL statements (if necessary) and DML statements, then you don't have to worry about any formatting issues. Just convert the CSV data into INSERT statements and be done. The only issue you would be left with is not being able to test the scripts. But still, outside of which type of quote or brackets to use for object names and string literals, it's really hard to go wrong with:

INSERT INTO table (column1_name, column2_name, ...) VALUES (column1_val, column2_val, ...);

And to make things even easier:

Enclose table and column names in square brackets: [ and ] (e.g. [Customers]), or double-quotes: " (e.g. "Customers")
Enclose string, date, and GUID literals in regular single-quotes / apostrophes: '. (e.g. 'some text')
If any string literals are for Unicode / NVARCHAR / XML columns, prefix those literals with a capital-"N": N (e.g. N'some unicode text')
Make sure that any date / datetime values are formatted in a compatible way (there are several options and some depend on the current language setting in SQL Server, but generally YYYY-MM-DD HH:mm:ss.mmm should be fine)
Max size/length for CHAR / VARCHAR and BINARY / VARBINARY columns is 8000. Anything larger will need to use VARCHAR(MAX) and VARBINARY(MAX), respectively.
Max size/length for NCHAR / NVARCHAR columns (i.e. Unicode) is 4000. Anything larger will need to use NVARCHAR(MAX).

Since MySQL was mentioned in the question, even if indirectly, I figured it wouldn't hurt to test using mysqldump to see how close it can actually get. I used the following options:

--flush-logs
--compatible=mssql
--no-create-db
--no-tablespaces
--order-by-primary
--default-character-set=utf8
--flush-logs
--skip-set-charset
--hex-blob
--result-file
--skip-add-locks
--skip-triggers
--skip-comments
--skip-disable-keys
--compact
--skip-extended-insert
--skip-add-drop-table
--quote-names

Some notes:

I used utf8 as the charset since utf16le, ucs2, and a few others are not allowed in this particular context :-(
The bottom two options -- --skip-add-drop-table and --quote-names -- might not need to be explicitly stated as they might be implied by --compatible=mssql
At minimum you will need to do some string replacements on the output file to adjust the syntax for SQL Server. In the list below, pay close attention to the space before, and sometimes after, the "before" and "after" strings as they are necessary to prevent matching parts of valid strings.
1. " int(11) " -> " INT "
2. " timestamp " -> " DATETIME "
3. " text " -> " VARCHAR(MAX) "
4. " blob " -> " VARBINARY(MAX) "
5. " DEFAULT NULL" -> " NULL" (no space at the end of either the "before" or "after" strings)
After making the 5 adjustments noted directly above, you are still left with a few potential issues:
1. Did any of the string replacements above accidentally match text within any string columns?
2. If there are any varchar columns with a length of 8001 or more, those lengths will need to be translated to the string MAX so that the resulting data type will be VARCHAR(MAX) instead of VARCHAR(20000).
3. If you have any Unicode data, especially any characters that do not fit into the Code Page of the default collation of the destination database, it will get converted into question marks ( ? ) unless you can find a way to prefix those string literals with a capital-N. Along these same lines, the CREATE TABLE statements will need to be adjusted for any Unicode columns such that they are NCHAR / NVARCHAR instead of CHAR / VARCHAR. Non-Unicode columns can remain as CHAR / VARCHAR.
4. potentially other nuances / differences that I have missed.

P.S. For the sake of completeness as it relates to migrating data from MySQL to SQL Server, Microsoft does have a Migration Assistant, but from what I can tell so far it requires a connection from SQL Server to MySQL and will not work on a dump file.

UPDATE

New info (from comment on this answer):

The source is scraped data from German web pages currently stored in a CSV file. I could easily import it from there into MySQL / Postgres if that was a reliable vector.

So here's a thought. If you can get this data into one RDBMS, then why not another? Assuming you were going to set up a temporary Windows VM for testing anyway, you could just:

Install SQL Server 2008 Express Edition (i.e. free) on the VM
Create a database to hold this data (it will make life much easier if you can find out from your client what default collation their database is using and use the same one for this new database!)
Create the intended schema (tables, etc) for this data
Import this data directly into these tables
Take a BACKUP of the this database
Create another database that will act as your client's database
Write a script that will:
1. RESTORE the backup file into a new database with a new name (not the original name)
2. copy the tables / data from the new database into the "customer" database
3. DROP the new / data migration database
TEST!
Give the backup file and the script that makes use of it to your client.

While this does require that you write a T-SQL script, it does not require any additional software or install, and the data is already in columns of the proper datatypes and doesn't need to be parsed :-).

P.S. Again, for the sake of completeness I was also looking at SQL Server Data Tools (SSDT) as a possible solution as it can export schema and data into a single .dacpac file and import that into a destination. However, this does require installing SSDT. But, you wouldn't have to write a T-SQL script as SSDT handles all of the SQL for you.

Best Answer

Related Solutions

Sql-server – Would appreciate some advice on using external data with SQL Server 2005

How to Import CSV with Multiple Field Separators in MySQL

Related Question