Inspired by this question, can I use the iconv
command to generate UTF-16 output with a BOM and with specified endianness?
The iconv
command converts text from one encoding to another.
For example:
echo hello | iconv -f ascii -t utf-16
generates a UTF-16 representation of "hello\n"
.
UTF-16 files often, but not always, start with a Byte Order Mark (BOM), which is a 2-byte encoding of the Unicode character U+FEFF
. You can determine the endianness of a UTF-16 file with BOM by checking whether the first two bytes are FE FF
or FF FE
.
The iconv
command has several options for generating UTF-16 output:
$ iconv --list | grep -i utf-16
UTF-16//
UTF-16BE//
UTF-16LE//
This command:
echo hello | iconv -f ascii -t utf-16be
generates big-endian UTF-16 with no BOM; it seems to assume that if you specified the endianness, you don't need to indicate it in the output. Similarly, utf-16le
generates little-endian UTF-16 with no BOM.
This:
echo hello | iconv -f ascii -t utf-16
generates (on my x86 Ubuntu system) little-endian UTF-16 with a BOM — but I've seen a report of a similar command generating big-endian UTF-16 with a BOM, even on a little-endian system.
I can always use utf-16be
or utf-16le
and prepend the BOM manually, but I'm looking for a solution that just uses the iconv
command.
Another workaround, if you know what endianness -t utf-16
generates, is:
echo hello | iconv -f ascii -t utf-16 | dd conv=swab 2>/dev/null
What I'd like to use is something like:
iconv -f ascii -t utf-16bebom # big-endian with BOM
iconv -f ascii -t utf-16lebom # little-endian with BOM
but iconv
doesn't support that.
EDIT :
Can someone with access to an x86 Mac OSX system post a comment showing the (copy-and-pasted) output of the following command?
echo hello | iconv -f ascii -t utf-16 | od -x
Best Answer
No, if you specify the byte ordering,
iconv
does not insert a BOM.This is from The Unicode Consortium
(my emphasis)
I expect
iconv
is attempting to be faithful to the last of these guidelines.Update.
A digression
In my opinion:
An option to specify a BOM would certainly be a useful additional feature for iconv.
A UTF-16LE file without a BOM is usable in Windows, albeit with additional effort sometimes. For example Notepad's File Open dialogue allows you to select "Unicode" which is Microsoft's name for "UTF-16LE" and (unsurprisingly) seems to work on files without a BOM.
I can open a UTF-16LE test file (without BOM) or a UTF-8 test file (without BOM) in Windows Notepad (XP) in the usual way e.g. by double-clicking the file's name in explorer. That seems usable to me. I am aware that sometimes Windows will guess the encoding incorrectly - In which case you have to tell Notepad the encoding when opening the file. This inconvenience means including a BOM is preferable for text files intended for use on Windows.
If a specific application will not work with anything other than a UTF-16LE file with BOM, then I would agree that a UTF-16LE file without BOM is not usable for that specific application.
I suspect that if you can make everything work with UTF-8 (without BOM), that is the best solution in the long term.
However the answer to the question "can I use the iconv command to generate UTF-16 output with a BOM and with specified endianness" is currently "No".