Text Processing – Handling Files with BOM (FF FE)

character encodingtext processingunicode

I received a .csv file with the FF FE BOM:

$ head -n1 dotan.csv | hd
00000000  ff fe 41 00 64 00 20 00  67 00 72 00 6f 00 75 00  |..A.d. .g.r.o.u.|

When using awk to parse it I'm getting a bunch of null bytes, which I suspect is due to the byte order. How can I swap the byte order on this file (using the CLI) so that normal tools will work with it?

Note that I think that this file is only ASCII characters (except for the BOM), but I cannot confirm that as grep thinks that it is a binary file:

$ grep -P '^[\x00-\x7f]' dotan.csv 
Binary file dotan.csv matches

Searching for the same string in VIM shows every character matching!

Using iconv to convert to ASCII does not get rid of \x00 values, actually it makes the problem worse as now they look like null bytes instead of UTF-8!

$ iconv -f UTF-8 -t ASCII dotan.csv > fixed.txt 
iconv: illegal input sequence at position 0

$ iconv -f UTF-8 -t ASCII//IGNORE dotan.csv > fixed.txt

$ head -n1 fixed.txt | hd
00000000  41 00 64 00 20 00 67 00  72 00 6f 00 75 00 70 00  |A.d. .g.r.o.u.p.|

How can I swap the byte order on this file (using the CLI) so that normal tools will work with it?

Best Answer

From this wikipedia article, FF FE means UTF16LE. So you should tell iconv to convert from UTF16LE to UTF8:

iconv -f UTF-16LE -t UTF-8 dotan.csv > fixed.txt