Ubuntu – gedit can’t recognize character encoding, but gvim can


I have a lot of plain text files which come from a Windows environment.
Many of them use a whacky default Windows code-page, which is neither ASCII (7 bits) nor UTF-8.

gvim has no problem opening these files, but gedit fails to do so.
gvim reports the encoding as latin1.

I assume that gvim is making a "smart" assumption about the code-page.
(I believe this code-page still has international variants).

Some questions arise from this:

  • (1). Is there some way the gedit can be told to recoginze this code-page?
    ** NB. [Update] For this point (1), see my answer, below.
    ** For points (2) and (3). see Oli's answer.

  • (2). Is there a way to scan the file system to identify these problem files?

  • (3). Is there a batch converting tool to convert these files to UTF-8?

(.. this old-world text mayhem was actually the final straw which brought me over to Ubuntu… UTF-8 system-wide by default Brilliant)

** NB: ** I now consider the following Update to be partially irrelevent, because the "problem" files aren't the "problem" (see my answer below).
I've left it here, because is may be of some general use to someone.

I've worked out a rough and ready way to identify the problem files…
The file command was not suitable, because it identified my example file as ASCII… but an ASCII file is 100% UTF-8 compliant…

As I mentioned in a comment below, the test for an invalid first byte of a UTF-8 codepoint is:

  • if the first byte (of a UTF-8 codepoint) is between 0x80 and 0xBF (reserved for additional bytes), or greater than 0xF7 ("overlong form"), that is considered an error

I know sed (a bit, via a Win32 port), so I've managed to cobble together a RegEx pattern which finds these offending bytes.

It's an ugly line, so look away now if regular expressions scare you 🙂

I'd really appreciate it if someone points out how to use hex values in a range [] expression.. I've just used the or operator \|

sed -n "/\x80\|\x81\|\x82\|\x83\|\x84\|\x85\|\x86\|\x87\|\x88\|\x89\|\x8A\|\x8B\|\x8C\|\x8D\|\x8E\|\x8F\|\x90\|\x91\|\x92\|\x93\|\x94\|\x95\|\x96\|\x97\|\x98\|\x99\|\x9A\|\x9B\|\x9C\|\x9D\|\x9E\|\x9F\|\xA0\|\xA1\|\xA2\|\xA3\|\xA4\|\xA5\|\xA6\|\xA7\|\xA8\|\xA9\|\xAA\|\xAB\|\xAC\|\xAD\|\xAE\|\xAF\|\xB0\|\xB1\|\xB2\|\xB3\|\xB4\|\xB5\|\xB6\|\xB7\|\xB8\|\xB9\|\xBA\|\xBB\|\xBC\|\xBD\|\xBE\|\xBF\|\xF8\|\xF9\|\xFA\|\xFB\|\xFC\|\xFD\|\xFE\|\xFF/p" "${fqfn}"  

So, I'll now graft this into Oli's batch solution… Thanks Oli!

PS. Here is the invalid UTF-8 byte it found in my sample file …
"H.Bork, Gøte-borg." … the "ø" = F8 hex… which is an invalid UTF-8 character.

Best Answer

iconv is probably what you'll want to use. iconv -l will show you the available encodings and then you can use a couple of commands to recode them all:

# all text files are in ./originals/
# new files will be written to ./newversions/

mkdir -p newversions
cd originals
for file in *.txt; do
    cat $file | iconv -f ASCII -t utf-8 > ../newversions/$file;

If you want to do this with files you don't the encoding of (because they're all over the place), you want to bring in a few more commands: find, file, awk and sed. The last two are just there to process the output of file.

for file in find . -type f -exec file --mime {} \; | grep "ascii" | awk '{print $1}' | sed s/.$//; do

I've no idea if this actually works so I certainly wouldn't run it from anything but the least important directory you have (make a testing folder with some known ASCII files in). The syntax of find might preclude it from being within a for loop. I'd hope that somebody else with more bash experience could jump in there and sort it out so it does the right thing.