I am not even sure if I am framing this question right, but I will try – I have a bunch of huge text files generated from an Oracle export on a Linux system. Each file is about 30 GB in size, and I have about 50 of them.
The goal is to export this data to Azure SQL Data Warehouse. BCP in this case is not the right approach keeping in mind the size of the data, so I had to use Polybase.
After converting from ASCII to UTF8 encoding, I ran into an issue while querying the external tables. Polybase doesn't play well with the fixed width text file with a line break in each line.
The text file looks something like this:
101,102,103,104,105,106,107 108,108,109,110,111,112,113 114,115,116,117,118,119,120 121,122,123 --nothing here, just a blank line 201,202,203,204,205,206,207 208,209,210,211,212,213,214 215,216,217
Polybase tries to process from 101 through 107 and errors out complaining there were not enough columns in this file to process.
Here is what I think is happening: The fixed width and line breaks are making it treat the line break as a row delimiter.
How do I convert this file to look like below:
101,102,103,104,105,106,107,108,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123{CR}{LF} 201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217{CR}{LF}
EDIT: Here is the sample data from a file. I opened it in git bash on a windows VM.
These files are supposed have 167 columns with ,
as the column separator. The problem is, since each row spawns multiple lines, its difficult to process them from Polybase external tables.
Best Answer
Polybase is powerful but not really that sophisticated so will not be able to deal with this weird format. You have three options as I see it:
Please try and answer my questions above and provide a sample file and I will try and help.