Get consistent encoding for all files in directory

character encodingfiles

I have a directory containing lots of csv files from various vendors with two different encodings:

ASCII Text / UTF-8
UCS2 / UTF-16 little endian

I'd like to use grep, awk, sed and other utilities on these datafiles using conventional syntax.

Re-encoding these files from UTF-16 to UTF-8 does not lose any useful data. All csv files only contain ASCII data so it's beyond me why they're being supplied as little-endian UTF-16 by some vendors, some of the time.

I've written a short script that parses the output of file, but I think it's probably quite fragile.

There must be better ways of managing files with multiple encodings, are there any programs or utilities that can assist with this sort of problem?

I'm using Debian Stable.

for f in ./*.csv
do
  if  [[ $(file "$f") == *"UTF-16"* ]]
  then
    iconv -f UTF-16 -t UTF-8 "$f" > "$f"-new
    mv "$f"-new "$f"
  fi
done

Best Answer

I'd refine your script to:

set -o noclobber
for f in ./*.csv
do
  if [ "$(file -b --mime-encoding "$f")" = utf-16le ]; then
    iconv -f UTF-16 -t UTF-8 "$f" > "$f"-new &&
      mv "$f"-new "$f"
  fi
done

Related Solutions

Most common encoding for strings in C++ in Linux (and Unix?)

This is just a partial answer, since your question is fairly broad.

C++ defines an "execution character set" (in fact, two of them, a narrow and a wide one).

When your source file contains something like:

char s[] = "Hello";

Then the numeric byte value of the letters in the string literal are simply looked up according to the execution encoding. (The separate wide execution encoding applies to the numeric value assigned to wide character constants L'a'.)

All this happens as part of the initial reading of the source code file into the compilation process. Once inside, C++ characters are nothing more than bytes, with no attached semantics. (The type name char must be one of the most grievous misnomers in C-derived languages!)

There is a partial exception in C++11, where the literals u8"", u"" and U"" determine the resulting value of the string elements (i.e the resulting values are globally unambiguous and platform-independent), but that does not affect how the input source code is interpreted.

A good compiler should allow you to specify the source code encoding, so even if your friend on an EBCDIC machine sends you her program text, that shouldn't be a problem. GCC offers the following options:

-finput-charset: input character set, i.e. how the source code file is encoded
-fexec-charset: execution character set, i.e. how to encode string literals
-fwide-exec-charset: wide execution character set, i.e. how to encode wide string literals

GCC uses iconv() for the conversions, so any encoding supported by iconv() can be used for those options.

I wrote previously about some opaque facilities provided by the C++ standard to handle text encodings.

Example: take the above code, char s[] = "Hello";. Suppose the source file is ASCII (i.e. the input encoding is ASCII). Then the compiler reads 99, and interprets it as c, and so on. When it comes to the literal, it reads 72, interprets it as H. Now it stores the byte value of H in the array which is determined by the execution encoding (again 72 if that is ASCII or UTF-8). When you write \xFF, the compiler reads 99 120 70 70, decodes it as \xFF, and writes 255 into the array.

How to make `file` output line break type and encoding for all file types

Following @don_crissti's comment, it turns out that passing an empty magicfile to file will make it fall back to detect the default ASCII.

So file -m /dev/null extra.module.php will do the trick in this case and will output the desired

extra.module.php: UTF-8 Unicode C++ program text, with CRLF line terminators

Best Answer

Related Solutions

Most common encoding for strings in C++ in Linux (and Unix?)

How to make `file` output line break type and encoding for all file types

Related Question