I think bash is tripping over some anomalies in how accented characters are handled. You might want to grab some popcorn, because this is going to get technical for a little bit...
Unicode allows some accented characters to be represented in several different ways: as a "code point" representing the accented character, or as a series of code points representing the unaccented version of the character, followed by the accent(s). For example, "ä" could be represented either precomposed as U+00E4 (UTF-8 0xc3a4, Latin small letter 1 with diaeresis) or decomposed as U+0061 U+0308 (UTF-8 0x61cc88, Latin small letter a + combining diaeresis).
OS X's HFS+ filesystem requires that all filenames be stored in the UTF-8 representation of their fully decomposed form. In an HFS+ filename, "ä" MUST be encoded as 0x61cc88, and "ö" MUST be encoded as 0x6fcc88.
I'm pretty sure what's happening here is that when you type "Näyttökuva.png" at the command line, it's "typing" the characters in precomposed form. When the file is created, the filesystem decomposes the characters for storage. Everything is fine so far. But when you try to use tab-completion starting with "Nä", I think bash is failing to decompose the "ä" before searching for matches, and of course it doesn't find any.
To illustrate the difference, here's an example of what encoding is used when I just type "Näyttökuva.png" at the command line, vs. what's used when I store it as a filename and use tab completion to fill it in:
$ printf Näyttökuva.png | xxd # This time I pasted the it in from this web page
0000000: 4ec3 a479 7474 c3b6 6b75 7661 2e70 6e67 N..ytt..kuva.png
$ touch Näyttökuva.png # Also pasted from the web
$ printf Näyttökuva.png | xxd # This time I tab-completed it after N
0000000: 4e61 cc88 7974 746f cc88 6b75 7661 2e70 Na..ytto..kuva.p
0000010: 6e67 ng
Now, as for the matter of characters getting lost when deleting and re-tab-completing, I suspect that's closely related. Specifically, I think bash is "deleting" one code point per press of the delete key, but erasing one character from the Terminal window per press. Because one of the deleted characters ("ö" this time) consisted of two code points, but only one character, the Terminal display gets out of sync. Try tab-completing the whole filename, deleting it back to "Näytt", then re-tab-completing: bash seems to think that only the combining diaeresis was deleted, not the entire "ö", so it re-adds the combining diaeresis, but it this time it attaches to the "t":
$ echo Näytẗkuva.png
Näyttökuva.png
Note that when I press return, bash actually has the entire filename there; it's just the Terminal display that was confused.
TL;DR bash has some bugs handling decomposable accented characters.
EDIT: after some mulling, I think the only full solution is to fix bash (/wait for its developers to fix it). There might also be a way to input characters in decomposed form, but I have no idea what that would be. But I did find some partial workarounds:
Drag-and-droping a file from the Finder pastes in its correct form. Since the Finder gets the filename from the filesystem, it's already decomposed, so it just works.
You can actually tab-complete the accented character itself. For example, if you type "Na" and then tab, it'll match "Näyttökuva.png" because the canonical decomposition of "ä" starts with "a". But if you have a file named "Narwal.gif" in the same directory, that won't be very helpful...
I haven't tested this, but if you bind tab to menu-complete instead of complete, it should let you tab through possible matches so you can select the one you want even if you can't type the next letter. (Or you could bind it to a different keystroke, so you can use it only when you need to.)
For fixing the problem with the Terminal display getting out of sync, you could bind something to redraw-current-line -- it won't prevent the problem from happening, but it'll give you a way to resynchronize the display.
Huh. Sure enough, it's not setting up VT100 line drawing by default, and apparently programs don't bother with little things like how you're supposed to send enacs
before using smacs
any more (no doubt because some Linux terminal emulator doesn't require it, therefore "nobody does").
Anyway, quick fix (here, at least) is to add to your ~/.bashrc
test -t && tput enacs
Best Answer
I might not be able to fully solve your problem, but I can explain some of what's going on. The shell is behaving correctly; TextWrangler is not coping correctly with a slightly advanced requirement.
In
test.txt
, you have ana
(garden-variety lowercase letter A) followed by a combining tilde (Unicode character U+0303). Combining characters generalize characters with accents. For all intents and purposes,ã
(U+0061 LATIN SMALL LETTER A followed by U+0303 COMBINING TILDE) should be equivalent toã
(U+00E3 LATIN SMALL LETTER A WITH TILDE).Quite possibly, if Unicode was invented now, only combining characters would exist, and we'd always use
a
; but Unicode also has many characters for compatibility with earlier existing encodings. Because these are the characters almost everybody uses, many programs do not support combining characters so well, if at all. In particular, it looks like TextWrangler does not support them at all and shows a “I don't know what this is” mark instead.Generally speaking, OSX prefers decomposed characters (i.e. letter + combining accent). In particular, as far as I know, all file names are normalized to this form. Normalizing file names (i.e. making sure that if there are several possible forms of a file name, then a specific one will always be used) is very useful, because it avoids being unable to find
leão.png
when you're looking forleão.png
. (You don't see a difference between the two? Good, your browser handles combining characters correctly.)The ideal solution would be for you to use an editor that handles combining characters correctly. If you want to stick with TextWrangler, make sure you have the latest version, and if you do, contact the authors for support. With TextEdit, jEdit or AlphaX, there's hope yet: they're showing the file as Mac Roman instead of UTF-8; try to switch them to UTF-8.