Azure SQL Data Warehouse – Convert Fixed Width File to Delimited for Data Ingestion

azure-sql-data-warehouseimportpolybase

I am not even sure if I am framing this question right, but I will try – I have a bunch of huge text files generated from an Oracle export on a Linux system. Each file is about 30 GB in size, and I have about 50 of them.

The goal is to export this data to Azure SQL Data Warehouse. BCP in this case is not the right approach keeping in mind the size of the data, so I had to use Polybase.

After converting from ASCII to UTF8 encoding, I ran into an issue while querying the external tables. Polybase doesn't play well with the fixed width text file with a line break in each line.

The text file looks something like this:

101,102,103,104,105,106,107
108,108,109,110,111,112,113
114,115,116,117,118,119,120
121,122,123

--nothing here, just a blank line

201,202,203,204,205,206,207
208,209,210,211,212,213,214
215,216,217

Polybase tries to process from 101 through 107 and errors out complaining there were not enough columns in this file to process.

Here is what I think is happening: The fixed width and line breaks are making it treat the line break as a row delimiter.

How do I convert this file to look like below:

101,102,103,104,105,106,107,108,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123{CR}{LF}
201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217{CR}{LF}

EDIT: Here is the sample data from a file. I opened it in git bash on a windows VM.

These files are supposed have 167 columns with , as the column separator. The problem is, since each row spawns multiple lines, its difficult to process them from Polybase external tables.

Best Answer

Polybase is powerful but not really that sophisticated so will not be able to deal with this weird format. You have three options as I see it:

Correct the file format at source. Instead of having weird mix of fixed-width and delimited file format, use a standard file format, such as .csv. This format you have where the column delimiter is either carriage return or comma I think is genuinely strange. Are there any tools that can read this easily? Is this a common format where you work?
Import the file specifying one of the delimiters into one row, then shred it based on the other delimiter. I started trying this with your sample data but did not get very far. Do different rows really have different numbers of columns? In your sample data, row 1 has 24 columns and row 2 has 17 columns. Please provide a small sample file eg via gist that accurately represents your data.
Write a highly customised import routine. The point of using standard data interchange formats like csv, tab-delimted, pipe-delimited, XML, JSON etc is that you should not have to write a highly custom routine every time you want to import some data. However, this could be an option, if you cannot change your file at source, or import it in stages. I've been working with Azure Data Lake Analytics (ADLA) and U-SQL recently, and that might be able to do this.

Please try and answer my questions above and provide a sample file and I will try and help.

Best Answer

Related Solutions

Sql-server – Fixed length text file insert into SQL table

SQL Server – VLDB with Large Quantity of Data File Imports

Related Question