AWK with BOM: Is there any cool way to handle Unicode BOM with regexp

awkregular expressionunicode

I have two files encoded in UTF-8 with/without BOM:

/tmp/bom$ ls
list.bom.txt  list.nobom.txt
/tmp/bom$ cat list.nobom.txt 
apple
banana
avocado
寿司
melon
/tmp/bom$ diff list.nobom.txt list.bom.txt 
1c1
< apple
---
> apple
/tmp/bom$ file list.nobom.txt list.bom.txt 
list.nobom.txt: UTF-8 Unicode text
list.bom.txt:   UTF-8 Unicode (with BOM) text

The only diff between two files is header BOM EF BB BF.

Then, in order to filter the lines that begin with 'a', I write a short awk script using a caret.

/tmp/bom$ gawk '/^a.*/' list.nobom.txt
apple
avocado
/tmp/bom$ gawk '/^a.*/' list.bom.txt
avocado

Unfortunately, with header BOM, apple in the first line is ignored.

Therefore, my question is: Is there any way to handle this?

I consider three solutions:

  1. Write BOM bytes directly. For example,

    gawk 'BEGIN { pat = "^(\xef\xbb\xbf)?a.*" } $0 ~ pat { print }'
    

    works in UTF-8. However, this doesn't handle other encodings. Moreover, if there is U+FEFF used as Zero Width Non-Breaking Space (see comments), the above script fails in some cases.

  2. Delete BOM bytes by re-encoding with nkf. For example,

    nkf --oc=UTF-8 list.bom.txt | gawk '/^a.*/'
    

    works. However, I wonder if there is a more sophisticated way.

  3. [ADDED] This is an improvement of the first one, using bash feature.

    gawk -v bom="$(echo -e '\uFEFF')" '
        NR == 1 {
            pat = "^" bom;
            sub(pat, "")
        }
        /^a.*/ {
            print
        }
    '
    

    This works for both UTF-8 with/without BOM. However this doesn't works for UTF-16 in my environment. So, the second solution is better.

Moreover, I think this is also the problem for grep, sed, or other scripts using regular expression matching.
So, if there is a general solution, it would be more appreciated.

Best Answer

A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.

dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.

dos2unix < file.win.txt | awk ...