I'm completely puzzled about a problem with a single plain text file in my system fedora 12. I used a known software in bioinformatics, maker, to produce lots of plain text files and one of them it seems to be "inaccessible".
Particularly, my file named Clon1918K_PCC1.gff
is listed when I use ls, ls -a, ls -li
… commands but when I try to access it by cat, vim, cp, ls
etc it appears always the same error Clon1918K_PCC1.gff: No such file or directory
.
However, when I copy all the files with cp *.gff
or cp *
this file it's also copied.
Also I tried to open it with nautilus without problem and in one of two cases when I copied the content to another file with the same name the problem disappears. Interestingly in this case the strange file is not rewritten and 2 files with exactly the same name appear, one of them accessible and another inaccessible. I looked for hidden characters but all seems ok.
Someone has any idea about this strange case??
Thanks!
Best Answer
You can't have two files with the same name in the same directory. Filenames are by definition unique keys.
What you have is almost certainly a special character. I know you checked for them, but how exactly? You could say something like
ls *gff | hexdump -C
to find where the special characters are. Any byte with the high bit set (that is, hexadecimal values between80
andFF
) will be an indication of something gone wrong. Anything below20
(decimal 32) is also a special character. Another hint is the presence of dots.
in the right, text column ofhexdump -C
.There are numerous characters that look like US ASCII characters in UTF-8. Even in US ASCII, 1 and l can often look similar. Then, you have The C from Cyrillic (U+0421), the Greek Lunate Sigma (U+03F9, also exactly like a C), Cyrillic/Greek lower case ‘o’, etc. And those are just the visible ones. There are quite a few invisible Unicode characters that could be in there.
Explanation: why does the high bit signify something gone wrong? The filename ‘Clon1918K_PCC1.gff’ appears to be 100% 7-bit US ASCII. Putting it through
hexdump -C
produces this:All of these byte values are below
0x80
(8th bit clear) because they are all 7-bit US ASCII codepoints. Unicode codepoints U+0000 to U+007F represent the traditional 7-bit US ASCII characters. Codepoints U+0080 and above represent other characters and are encoded as two to six bytes in UTF-8 (on Linux, tryman utf8
for a lot of information on how this is done). By definition, UTF-8 encodes US-ASCII codepoints as themselves (i.e. hex ASCII character41
, Unicode U+0041, is encoded as the single byte41
). Codepoints ≥ 128 are encoded as two to six bytes, each of which have the eighth bit set. The presence of a non-ASCII character can easily be detected by this without having to decode the stream. For example, say I replace the third character in the filename, ‘o’ (ASCII6f
, U+006F) with the Unicode character ‘U+03FB GREEK SMALL LETTER OMICRON’ which looks like this: ‘ο’.hexdump -C
then produces this:The third character is now encoded as the UTF-8 sequence
ce bf
, each byte of which has its 8th bit set. And this is your sign of trouble in this case. Also, note howhexdump
, which only decodes 7-bit ASCII, fails to decode the single UTF-8 character and shows two unprintable characters (..
) instead.