Strange case: Text file that exist and doesn’t exist

filenamesfilesls

I'm completely puzzled about a problem with a single plain text file in my system fedora 12. I used a known software in bioinformatics, maker, to produce lots of plain text files and one of them it seems to be "inaccessible".

Particularly, my file named Clon1918K_PCC1.gff is listed when I use ls, ls -a, ls -li … commands but when I try to access it by cat, vim, cp, ls etc it appears always the same error Clon1918K_PCC1.gff: No such file or directory.

However, when I copy all the files with cp *.gff or cp * this file it's also copied.

Also I tried to open it with nautilus without problem and in one of two cases when I copied the content to another file with the same name the problem disappears. Interestingly in this case the strange file is not rewritten and 2 files with exactly the same name appear, one of them accessible and another inaccessible. I looked for hidden characters but all seems ok.

Someone has any idea about this strange case??
Thanks!

Best Answer

You can't have two files with the same name in the same directory. Filenames are by definition unique keys.

What you have is almost certainly a special character. I know you checked for them, but how exactly? You could say something like ls *gff | hexdump -C to find where the special characters are. Any byte with the high bit set (that is, hexadecimal values between 80 and FF) will be an indication of something gone wrong. Anything below 20 (decimal 32) is also a special character. Another hint is the presence of dots . in the right, text column of hexdump -C.

There are numerous characters that look like US ASCII characters in UTF-8. Even in US ASCII, 1 and l can often look similar. Then, you have The C from Cyrillic (U+0421), the Greek Lunate Sigma (U+03F9, also exactly like a C), Cyrillic/Greek lower case ‘o’, etc. And those are just the visible ones. There are quite a few invisible Unicode characters that could be in there.


Explanation: why does the high bit signify something gone wrong? The filename ‘Clon1918K_PCC1.gff’ appears to be 100% 7-bit US ASCII. Putting it through hexdump -C produces this:

00000000  43 6c 6f 6e 31 39 31 38  4b 5f 50 43 43 31 2e 67  |Clon1918K_PCC1.g|
00000010  66 66                                             |ff|

All of these byte values are below 0x80 (8th bit clear) because they are all 7-bit US ASCII codepoints. Unicode codepoints U+0000 to U+007F represent the traditional 7-bit US ASCII characters. Codepoints U+0080 and above represent other characters and are encoded as two to six bytes in UTF-8 (on Linux, try man utf8 for a lot of information on how this is done). By definition, UTF-8 encodes US-ASCII codepoints as themselves (i.e. hex ASCII character 41, Unicode U+0041, is encoded as the single byte 41). Codepoints ≥ 128 are encoded as two to six bytes, each of which have the eighth bit set. The presence of a non-ASCII character can easily be detected by this without having to decode the stream. For example, say I replace the third character in the filename, ‘o’ (ASCII 6f, U+006F) with the Unicode character ‘U+03FB GREEK SMALL LETTER OMICRON’ which looks like this: ‘ο’. hexdump -C then produces this:

00000000  43 6c ce bf 6e 31 39 31  38 4b 5f 50 43 43 31 2e  |Cl..n1918K_PCC1.|
00000010  67 66 66                                          |gff|

The third character is now encoded as the UTF-8 sequence ce bf, each byte of which has its 8th bit set. And this is your sign of trouble in this case. Also, note how hexdump, which only decodes 7-bit ASCII, fails to decode the single UTF-8 character and shows two unprintable characters (..) instead.