Character Encoding – How to Re-encode a Mixed Encoded Text File

character encodingconversion

I've got a log file that is ASCII, except for a few UTF-8 characters (which I can fix for a future version).

For the moment, I need to figure out how to get this file to a viewable/searchable/editable state by gedit/less etc.

enca -L none file returns 7bit ASCII characters
Surrounded by/intermixed with non-text data
.

enconv -L none -X ASCII file and enconv -L none -X UTF-8 file "succeed" but do not actually change anything.

How do I go about fixing this file?

Update (after some answers):

Actually, as stated below (upvotes to all :)), ASCII + UTF-8 is UTF-8. What I have is

0003bbc0  28 4c 6f 61 64 65 72 29  20 50 61 74 69 65 6e 74  |(Loader) Patient|
0003bbd0  20 00 5a 00 5a 00 5a 00  38 00 31 00 30 00 34 00  | .Z.Z.Z.8.1.0.4.|
0003bbe0  20 6e 6f 74 20 66 6f 75  6e 64 20 69 6e 20 64 61  | not found in da|
0003bbf0  74 61 62 61 73 65 0d 0a  32 36 20 53 65 70 20 32  |tabase..26 Sep 2|

I believe it will be a will be a cp1252-type encoding. Actually, I don't know what it is the cp1252 will be a 1-byte for ASCII won't it?

Incidentally, the fact that linux barfs on this helped me figure out that an input file (where the id's came from) was encoded badly…

Best Answer

What you have is in fact ASCII (in its usual encoding in 8-bit bytes) with a bit of UCS-2 (Unicode restricted to the basic plane (BMP), where each character is encoded as two 8-bit bytes), or perhaps UTF-16 (an extension of UCS-2 that can encode all of Unicode by using a multi-word encoding for code points above U+D7FF).

I doubt you'll find a tool that can handle such an unholy mixture out of the box. There is no way to decode the file in full generality. In your case, what probably happened is that some ASCII data was encoded into UTF-16 at some point (Windows and Java are fond of UTF-16; they're practically unheard of in the unix world). If you go by the assumption that the original data was all ASCII, you can recover a usable file by stripping off all null bytes.

<bizarre tr -d '\000' >ascii
Related Question