Convert binary encoding that head and Notepad can read to UTF-8

character encodingcommand line

I have a CSV file which is in binary character set but I have to convert to UTF-8 to process in HDFS (Hadoop).

I have used the below command to check characterset.

file -bi filename.csv

Output :

application/octet-stream; charset=binary

when I try to convert the file from binary to UTF-8 it throws error.

iconv -f binary -t utf-8 fiename.csv
iconv: conversion from binary' is not supported
Try iconv --help' or iconv --usage' for more information.

can anyone please help me to understand is it possible to convert or not, I can able to see the data using head command.

What does it mean , Binary means non-readable but how head command or notepad can read the data.

od -tc < filename.csv | head

0000000 357 273 277   |   |   R   e   q   u   e   s   t   _   I   D   #
0000020   D   #   T   y   p   e   #   D   #   S   u   b   m   i   t   t
0000040   e   r   #   D   #   S   h   o   r   t   _   D   e   s   c   r
0000060   i   p   t   i   o   n   #   D   #   L   o   g   _   T   e   x
0000100   t   #   D   #   S   t   a   t   u   s   #   D   #   A   s   s
0000120   i   g   n   e   d   _   T   o   #   D   #   A   s   s   i   g
0000140   n   e   e   #   D   #   C   r   e   a   t   e   _   D   a   t
0000160   e   #   D   #   F   o   r   w   T   o   E   x   t   H   D   #
0000200   D   #   L   a   s   t   _   M   o   d   i   f   i   e   d   _
0000220   B   y   #   D   #   L   o   g   _   I   D   #   D   #   L   o

Best Answer

"binary" isn't an encoding (character-set name). iconv needs an encoding name to do its job.

The file utility doesn't give useful information when it doesn't recognize the file format. It could be UTF-16 for example, without a byte-encoding-mark (BOM). notepad reads that. The same applies to UTF-8 (and head would display that since your terminal may be set to UTF-8 encoding, and it would not care about a BOM).

If the file is UTF-16, your terminal would display that using head because most of the characters would be ASCII (or even Latin-1), making the "other" byte of the UTF-16 characters a null.

In either case, the lack of BOM will (depending on the version of file) confuse it. But other programs may work, because these file formats can be used with Microsoft Windows as well as portable applications that may run on Windows.

To convert the file to UTF-8, you have to know which encoding it uses, and what the name for that encoding is with iconv. If it is already UTF-8, then whether you add a BOM (at the beginning) is optional. UTF-16 has two flavors, according to which byte is first. Or you could even have UTF-32. iconv -l lists these:

ISO-10646/UTF-8/
ISO-10646/UTF8/
UTF-7//
UTF-8//
UTF-16//
UTF-16BE//
UTF-16LE//
UTF-32//
UTF-32BE//
UTF-32LE//
UTF7//
UTF8//
UTF16//
UTF16BE//
UTF16LE//
UTF32//
UTF32BE//
UTF32LE//

"LE" and "BE" refer to little-end and big-end for the byte-order. Windows uses the "LE" flavors, and iconv likely assumes that for the flavors lacking "LE" or "BE".

You can see this using an octal (sic) dump:

$ od -bc big-end
0000000 000 124 000 150 000 165 000 040 000 101 000 165 000 147 000 040
         \0   T  \0   h  \0   u  \0      \0   A  \0   u  \0   g  \0    
0000020 000 061 000 070 000 040 000 060 000 065 000 072 000 060 000 061
         \0   1  \0   8  \0      \0   0  \0   5  \0   :  \0   0  \0   1
0000040 000 072 000 065 000 067 000 040 000 105 000 104 000 124 000 040
         \0   :  \0   5  \0   7  \0      \0   E  \0   D  \0   T  \0    
0000060 000 062 000 060 000 061 000 066 000 012
         \0   2  \0   0  \0   1  \0   6  \0  \n
0000072

$ od -bc little-end
0000000 124 000 150 000 165 000 040 000 101 000 165 000 147 000 040 000
          T  \0   h  \0   u  \0      \0   A  \0   u  \0   g  \0      \0
0000020 061 000 070 000 040 000 060 000 065 000 072 000 060 000 061 000
          1  \0   8  \0      \0   0  \0   5  \0   :  \0   0  \0   1  \0
0000040 072 000 065 000 067 000 040 000 105 000 104 000 124 000 040 000
          :  \0   5  \0   7  \0      \0   E  \0   D  \0   T  \0      \0
0000060 062 000 060 000 061 000 066 000 012 000
          2  \0   0  \0   1  \0   6  \0  \n  \0
0000072

Assuming UTF-16LE, you could convert using

iconv -f UTF-16LE// -t UTF-8// <input >output

Related Solutions

How to test the encoding of a text file… Is it valid, and what is it

The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.

Demonstration:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

Here is how I created the files:

$ echo ä > umlaut-utf8.txt

Nowadays everything is utf-8. But convince yourself:

$ hexdump -C umlaut-utf8.txt 
00000000  c3 a4 0a                                          |...|
00000003

Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding

Convert to the other encodings:

$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt

Check the hex dump:

$ hexdump -C umlaut-iso88591.txt 
00000000  e4 0a                                             |..|
00000002
$ hexdump -C umlaut-utf16.txt 
00000000  ff fe e4 00 0a 00                                 |......|
00000006

Create something "invalid" by mixing all three:

$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt

What file says:

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt:    application/octet-stream; charset=binary
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-8

without -i:

$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt:    data
umlaut-utf16.txt:    Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt:     UTF-8 Unicode text

The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.

One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.

Here is more information about the file command: http://www.linfo.org/file_command.html

How to set VIM’s default encoding to UTF-8

When Vim reads an existing file, it tries to detect the file encoding. When writing out the file, Vim uses the file encoding that it detected (except when you tell it differently). So a file detected as UTF-8 is written as UTF-8, a file detected as Latin-1 is written as Latin-1, and so on.

By default, the detection process is crude. Every file that you open with Vim will be assumed to be Latin-1, unless it detects a Unicode byte-order mark at the top. A UTF-8 file without a byte-order mark will be hard to edit because any multibyte characters will be shown in the buffer as character sequences instead of single characters.

Worse, Vim, by default, uses Latin-1 to represent the text in the buffer. So a UTF-8 file with a byte-order mark will be corrupted by down-conversion to Latin-1.

The solution is to configure Vim to use UTF-8 internally. This is, in fact, recommended in the Vim documentation, and the only reason it is not configured that way out of the box is to avoid creating enormous confusion among users who expect Vim to operate basically as a Latin-1 editor.

In your .vimrc, add set encoding=utf-8 and restart Vim.

Or instead, set the LANG environment variable to indicate that UTF-8 is your preferred character encoding. This will affect not just Vim but any software which relies on LANG to determine how it should represent text. For example, to indicate that text should appear in English (en), as spoken in the United States (US), encoded as UTF-8 (utf-8), set LANG=en_US.utf-8.

Now Vim will use UTF-8 to represent the text in the buffer. Plus, it will also make a more determined effort to detect the UTF-8 encoding in a file. Besides looking for a byte-order mark, it will also check for UTF-8 without a byte-order mark before falling back to Latin-1. So it will no longer corrupt a file coded in UTF-8, and it should properly display the UTF-8 characters during the editing session.

For more information on how Vim detects the file encoding, see the fileencodings option in the Vim documentation.

For more information on setting the encoding that Vim uses internally, see the encoding option.

If you need to override the encoding used when writing a file back to disk, see the fileencoding option.

Best Answer

Related Solutions

How to test the encoding of a text file… Is it valid, and what is it

How to set VIM’s default encoding to UTF-8

Related Question