Iconv generating UTF-16 with BOM

unicode

Inspired by this question, can I use the iconv command to generate UTF-16 output with a BOM and with specified endianness?

The iconv command converts text from one encoding to another.

For example:

echo hello | iconv -f ascii -t utf-16

generates a UTF-16 representation of "hello\n".

UTF-16 files often, but not always, start with a Byte Order Mark (BOM), which is a 2-byte encoding of the Unicode character U+FEFF. You can determine the endianness of a UTF-16 file with BOM by checking whether the first two bytes are FE FF or FF FE.

The iconv command has several options for generating UTF-16 output:

$ iconv --list | grep -i utf-16
UTF-16//
UTF-16BE//
UTF-16LE//

This command:

echo hello | iconv -f ascii -t utf-16be

generates big-endian UTF-16 with no BOM; it seems to assume that if you specified the endianness, you don't need to indicate it in the output. Similarly, utf-16le generates little-endian UTF-16 with no BOM.

This:

echo hello | iconv -f ascii -t utf-16

generates (on my x86 Ubuntu system) little-endian UTF-16 with a BOM — but I've seen a report of a similar command generating big-endian UTF-16 with a BOM, even on a little-endian system.

I can always use utf-16be or utf-16le and prepend the BOM manually, but I'm looking for a solution that just uses the iconv command.

Another workaround, if you know what endianness -t utf-16 generates, is:

echo hello | iconv -f ascii -t utf-16 | dd conv=swab 2>/dev/null

What I'd like to use is something like:

iconv -f ascii -t utf-16bebom # big-endian with BOM
iconv -f ascii -t utf-16lebom # little-endian with BOM

but iconv doesn't support that.

EDIT :

Can someone with access to an x86 Mac OSX system post a comment showing the (copy-and-pasted) output of the following command?

echo hello | iconv -f ascii -t utf-16 | od -x

Best Answer

No, if you specify the byte ordering, iconv does not insert a BOM.

This is from The Unicode Consortium

Q: How I should deal with BOMs?

A: Here are some guidelines to follow:

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,
    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

(my emphasis)

I expect iconv is attempting to be faithful to the last of these guidelines.


Update.

A digression

In my opinion:

  1. An option to specify a BOM would certainly be a useful additional feature for iconv.

  2. A UTF-16LE file without a BOM is usable in Windows, albeit with additional effort sometimes. For example Notepad's File Open dialogue allows you to select "Unicode" which is Microsoft's name for "UTF-16LE" and (unsurprisingly) seems to work on files without a BOM.

  3. I can open a UTF-16LE test file (without BOM) or a UTF-8 test file (without BOM) in Windows Notepad (XP) in the usual way e.g. by double-clicking the file's name in explorer. That seems usable to me. I am aware that sometimes Windows will guess the encoding incorrectly - In which case you have to tell Notepad the encoding when opening the file. This inconvenience means including a BOM is preferable for text files intended for use on Windows.

  4. If a specific application will not work with anything other than a UTF-16LE file with BOM, then I would agree that a UTF-16LE file without BOM is not usable for that specific application.

  5. I suspect that if you can make everything work with UTF-8 (without BOM), that is the best solution in the long term.

However the answer to the question "can I use the iconv command to generate UTF-16 output with a BOM and with specified endianness" is currently "No".

Related Question