I've got a log file that is ASCII, except for a few UTF-8 characters (which I can fix for a future version).
For the moment, I need to figure out how to get this file to a viewable/searchable/editable state by gedit/less etc.
enca -L none file
returns 7bit ASCII characters
.
Surrounded by/intermixed with non-text data
enconv -L none -X ASCII file
and enconv -L none -X UTF-8 file
"succeed" but do not actually change anything.
How do I go about fixing this file?
Update (after some answers):
Actually, as stated below (upvotes to all :)), ASCII + UTF-8 is UTF-8. What I have is
0003bbc0 28 4c 6f 61 64 65 72 29 20 50 61 74 69 65 6e 74 |(Loader) Patient|
0003bbd0 20 00 5a 00 5a 00 5a 00 38 00 31 00 30 00 34 00 | .Z.Z.Z.8.1.0.4.|
0003bbe0 20 6e 6f 74 20 66 6f 75 6e 64 20 69 6e 20 64 61 | not found in da|
0003bbf0 74 61 62 61 73 65 0d 0a 32 36 20 53 65 70 20 32 |tabase..26 Sep 2|
I believe it will be a will be a cp1252-type encoding. Actually, I don't know what it is the cp1252 will be a 1-byte for ASCII won't it?
Incidentally, the fact that linux barfs on this helped me figure out that an input file (where the id's came from) was encoded badly…
Best Answer
What you have is in fact ASCII (in its usual encoding in 8-bit bytes) with a bit of UCS-2 (Unicode restricted to the basic plane (BMP), where each character is encoded as two 8-bit bytes), or perhaps UTF-16 (an extension of UCS-2 that can encode all of Unicode by using a multi-word encoding for code points above U+D7FF).
I doubt you'll find a tool that can handle such an unholy mixture out of the box. There is no way to decode the file in full generality. In your case, what probably happened is that some ASCII data was encoded into UTF-16 at some point (Windows and Java are fond of UTF-16; they're practically unheard of in the unix world). If you go by the assumption that the original data was all ASCII, you can recover a usable file by stripping off all null bytes.