Trying to use grep
today, I ran into the familiar problem of the Byte Order Mark (BOM) in a Unicode file (UTF-8, in this case). Specifically, I was trying to find a file beginning with XYZ
with the pattern grep '^XYZ'
, but of course grep
treated the BOM as three separate characters and did not match the first line of the file if the first line started with XYZ
. I even tried to update the regular expression to ignore spaces ('^[[:space:]]*XYZ'
), but to no avail.
Other questions have dealt with converting files or targeting the BOM specifically, but I want to know if POSIX tools have a general option to handle Unicode files correctly. If grep
handled the Unicode file correctly, it would consider the file contents to start after the BOM and match XYZ
on the first line just like any other line.
Best Answer
The Unicode Consortium has an FAQ that includes How should I deal with BOMs. This portion includes:
and
Note that UTF-8 is always of known endianness, because it has no endianness. So as long as you know the text is UTF-8, "the BOM should not be used."
Even
cat
will return incorrect results when using a BOM unnecessarily, as the BOMs of all files but the first will be treated as zero-width non-breaking spaces. But, the power of UNIX lies in filters.For operations on a single file or stream,
sed "1s/^$(printf '\357\273\277')//"
in a pipeline will strip a BOM if present, leaving all other streams intact.For operations with multiple files, a shell with process substitution (like Bash, but unfortunately not POSIX shell) is useful: