Character Encoding – Fix gedit Not Recognizing Character Encoding While gvim Can

asciigeditgvimutf-8

I have a lot of plain text files which come from a Windows environment.
Many of them use a whacky default Windows code-page, which is neither ASCII (7 bits) nor UTF-8.

gvim has no problem opening these files, but gedit fails to do so.
gvim reports the encoding as latin1.

I assume that gvim is making a "smart" assumption about the code-page.
(I believe this code-page still has international variants).

Some questions arise from this:

(1). Is there some way the gedit can be told to recoginze this code-page?
** NB. [Update] For this point (1), see my answer, below.
** For points (2) and (3). see Oli's answer.
(2). Is there a way to scan the file system to identify these problem files?
(3). Is there a batch converting tool to convert these files to UTF-8?

(.. this old-world text mayhem was actually the final straw which brought me over to Ubuntu… UTF-8 system-wide by default Brilliant)

[UPDATE]
** NB: ** I now consider the following Update to be partially irrelevent, because the "problem" files aren't the "problem" (see my answer below).
I've left it here, because is may be of some general use to someone.

I've worked out a rough and ready way to identify the problem files…
The file command was not suitable, because it identified my example file as ASCII… but an ASCII file is 100% UTF-8 compliant…

As I mentioned in a comment below, the test for an invalid first byte of a UTF-8 codepoint is:

if the first byte (of a UTF-8 codepoint) is between 0x80 and 0xBF (reserved for additional bytes), or greater than 0xF7 ("overlong form"), that is considered an error

I know sed (a bit, via a Win32 port), so I've managed to cobble together a RegEx pattern which finds these offending bytes.

It's an ugly line, so look away now if regular expressions scare you 🙂

I'd really appreciate it if someone points out how to use hex values in a range [] expression.. I've just used the or operator \|

fqfn="/my/fully/qualified/filename"  
sed -n "/\x80\|\x81\|\x82\|\x83\|\x84\|\x85\|\x86\|\x87\|\x88\|\x89\|\x8A\|\x8B\|\x8C\|\x8D\|\x8E\|\x8F\|\x90\|\x91\|\x92\|\x93\|\x94\|\x95\|\x96\|\x97\|\x98\|\x99\|\x9A\|\x9B\|\x9C\|\x9D\|\x9E\|\x9F\|\xA0\|\xA1\|\xA2\|\xA3\|\xA4\|\xA5\|\xA6\|\xA7\|\xA8\|\xA9\|\xAA\|\xAB\|\xAC\|\xAD\|\xAE\|\xAF\|\xB0\|\xB1\|\xB2\|\xB3\|\xB4\|\xB5\|\xB6\|\xB7\|\xB8\|\xB9\|\xBA\|\xBB\|\xBC\|\xBD\|\xBE\|\xBF\|\xF8\|\xF9\|\xFA\|\xFB\|\xFC\|\xFD\|\xFE\|\xFF/p" "${fqfn}"

So, I'll now graft this into Oli's batch solution… Thanks Oli!

PS. Here is the invalid UTF-8 byte it found in my sample file …
"H.Bork, Gøte-borg." … the "ø" = F8 hex… which is an invalid UTF-8 character.

Best Answer

iconv is probably what you'll want to use. iconv -l will show you the available encodings and then you can use a couple of commands to recode them all:

# all text files are in ./originals/
# new files will be written to ./newversions/

mkdir -p newversions
cd originals
for file in *.txt; do
    cat $file | iconv -f ASCII -t utf-8 > ../newversions/$file;
done

If you want to do this with files you don't the encoding of (because they're all over the place), you want to bring in a few more commands: find, file, awk and sed. The last two are just there to process the output of file.

for file in find . -type f -exec file --mime {} \; | grep "ascii" | awk '{print $1}' | sed s/.$//; do
    ...

I've no idea if this actually works so I certainly wouldn't run it from anything but the least important directory you have (make a testing folder with some known ASCII files in). The syntax of find might preclude it from being within a for loop. I'd hope that somebody else with more bash experience could jump in there and sort it out so it does the right thing.

Related Solutions

Text Editor – Stop Gedit, Gvim, Vim, Nano from Adding End-of-File Newline Character

For gEdit ~~there does not seem to be a way to disable having a newline inserted at the end~~ see this answer.

For Vim (and gVim ) you can - by executing the following option: :set binary. Setting to binary will save the file as is and not insert a newline at the end of the document (Unless there already is one, in which it will be retained). At anytime you can revert this option with :set nobinary.

Nano is a little easier. You can launch nano with the -L flag (or --nonewlines if you're more the visual type). This will disable the automatic addition of the newline to the end of the file (so long as one was not actually entered). Ideally you could setup this alias in your ~/.bashrc: alias nano="nano -L", which would launch Nano with the no newlines flag included in it!

Since I don't know exactly what you're doing I'm going to point you to Why should files end with a newline? which may explain why they're there. In short some applications will fail if they don't find an EOL directly before the EOF

Gvim – How to Use Quickly Instead of Gedit

I assume you mean you want to change the editor that Quickly loads when you ask it to.

Well I did some sleuthing... I'll show you what I did followed by the answer.

I fired off this command:
```
sudo find / -name "*quickly*" -exec grep gedit {} \;
```
That searches for all files with quickly in and then greps them for gedit. It was a long shot -- I should have refined the search so it was any paths with quickly in but it matched!
```
Binary file /usr/share/quickly/templates/ubuntu-application/internal/quicklyutils.pyc matches
    editor = "gedit"
```

I opened up /usr/share/quickly/templates/ubuntu-application/internal/quicklyutils.py (not the compiled version) in nano, searched for gedit and saw:

def get_quickly_editors():
    '''Return prefered editor for ubuntu-application template'''

    editor = "gedit"
    default_editor = os.environ.get("EDITOR")
    if not default_editor:
        default_editor = os.environ.get("SELECTED_EDITOR")
    if default_editor:
       editor = default_editor
    return editor

From that you can see it asks for the environment value EDITOR!

Just run your quickly commands as:
```
env EDITOR=gvim quickly edit
```
or export it to persist
```
export EDITOR=gvim
quickly edit
```
Add the export line to your ~/.bashrc line if you want it to persist between sessions.

Best Answer

Related Solutions

Text Editor – Stop Gedit, Gvim, Vim, Nano from Adding End-of-File Newline Character

Gvim – How to Use Quickly Instead of Gedit

Related Question