Grep Character Encoding – How to Use Grep/Ack with Files in Arbitrary Encoding?

character encodinggreplocale

On my Linux desktop I have a UTF-8 locale. When I try to search some KOI8-R encoded files with grep (ack), it fails. If I manually encode the pattern into KOI8-R and pass that as an argument, it works.

Is it possible to tell grep what encoding to use for the pattern? Or any other tool?

Best Answer

If all the files you're searching in have the same encoding:

LC_CTYPE=ru_RU.KOI8-R luit ack-grep "$(echo 'привет' | iconv -t KOI8-R)" *.txt

or in bash or zsh

LC_CTYPE=ru_RU.KOI8-R luit ack-grep "$(iconv -t KOI8-R <<<'привет')" *.txt

Or start a child shell in the desired encoding:

$ LC_CTYPE=ru_RU.KOI8-R luit
$ ack-grep 'привет' *.txt
$ exit

Luit (shipped with XFree86 and X.org) runs the program specified on its command line in the locale specified by the LC_CTYPE setting, assuming an UTF-8 terminal. So the command runs in the desired locale, and Luit translates its terminal output to UTF-8.

Another approach, if you have a directory tree with a lot of files in a different encoding, is to mount a view of that directory tree under a your prefered encoding. I think the fuseflt filesystem can do this (untested).

mkdir /utf8-view
fuseflt iconv-koi8r-utf8.conf /some/dir /utf8-view
ack-grep 'привет' /utf8-view/*.txt.utf8
fusermount -u /utf8-view

where the configuration file iconv-koi8r-utf8.conf contains

ext_in =
ext_out = *.utf8
flt_in =
flt_out = .utf8
flt_cmd = iconv -f KOI8-R -t UTF-8