Get consistent encoding for all files in directory

character encodingfiles

I have a directory containing lots of csv files from various vendors with two different encodings:

  • ASCII Text / UTF-8
  • UCS2 / UTF-16 little endian

I'd like to use grep, awk, sed and other utilities on these datafiles using conventional syntax.

Re-encoding these files from UTF-16 to UTF-8 does not lose any useful data. All csv files only contain ASCII data so it's beyond me why they're being supplied as little-endian UTF-16 by some vendors, some of the time.

I've written a short script that parses the output of file, but I think it's probably quite fragile.

There must be better ways of managing files with multiple encodings, are there any programs or utilities that can assist with this sort of problem?

I'm using Debian Stable.

for f in ./*.csv
do
  if  [[ $(file "$f") == *"UTF-16"* ]]
  then
    iconv -f UTF-16 -t UTF-8 "$f" > "$f"-new
    mv "$f"-new "$f"
  fi
done

Best Answer

I'd refine your script to:

set -o noclobber
for f in ./*.csv
do
  if [ "$(file -b --mime-encoding "$f")" = utf-16le ]; then
    iconv -f UTF-16 -t UTF-8 "$f" > "$f"-new &&
      mv "$f"-new "$f"
  fi
done
Related Question