I might not be able to fully solve your problem, but I can explain some of what's going on. The shell is behaving correctly; TextWrangler is not coping correctly with a slightly advanced requirement.
In test.txt
, you have an a
(garden-variety lowercase letter A) followed by a combining tilde (Unicode character U+0303). Combining characters generalize characters with accents. For all intents and purposes, ã
(U+0061 LATIN SMALL LETTER A followed by U+0303 COMBINING TILDE) should be equivalent to ã
(U+00E3 LATIN SMALL LETTER A WITH TILDE).
Quite possibly, if Unicode was invented now, only combining characters would exist, and we'd always use a
; but Unicode also has many characters for compatibility with earlier existing encodings. Because these are the characters almost everybody uses, many programs do not support combining characters so well, if at all. In particular, it looks like TextWrangler does not support them at all and shows a “I don't know what this is” mark instead.
Generally speaking, OSX prefers decomposed characters (i.e. letter + combining accent). In particular, as far as I know, all file names are normalized to this form. Normalizing file names (i.e. making sure that if there are several possible forms of a file name, then a specific one will always be used) is very useful, because it avoids being unable to find leão.png
when you're looking for leão.png
. (You don't see a difference between the two? Good, your browser handles combining characters correctly.)
The ideal solution would be for you to use an editor that handles combining characters correctly. If you want to stick with TextWrangler, make sure you have the latest version, and if you do, contact the authors for support. With TextEdit, jEdit or AlphaX, there's hope yet: they're showing the file as Mac Roman instead of UTF-8; try to switch them to UTF-8.
A somewhat dangerous solution is this, from the commandline:
find . -type f ! -regex '.*/[ -.0-~]*' -exec rm {} +
Replace the lone .
with the name of the top directory if you haven't changed to the relevant directory first. To be safe, however try first the shorter command
find . -type f ! -regex '.*/[ -.0-~]*'
and ensure that it only lists files you wish to delete. The regular expression (regexp, or regex) here will match any pathname that ends in a slash followed by any combination of printable ASCII characters excluding /
, the space characters being the first such and ~
the last, while .
and 0
surround /
in the ASCII sequence.
One caveat among many: I don't know for sure if your current locale might change the collating sequence of characters, and hence perhaps change the meaning of the regexp. I don't think it does, but if it does, running the commands as
LC_COLLATE=C find …
should remove the danger.
Yet another caveat: Please ensure you have a backup before you try this. I will not take the blame for any loss of data if you get it wrong. The commandline is a great tool for shooting yourself in the foot! Sometimes just a misplaced space can spell disaster. (In this case, for example, missing the single space after the left bracket is deadly.)
Best Answer
As of today, macOS Mojave ships with a quite outdated version of
groff
(1.19 or something...) which apparently cannot handle the-K
option. Thus it fails to recognise any fancy diacritics (german umlauts in my case), if you rungroff -Kutf8 ...
.You can get a newer version of groff on macOS by installing it via Homebrew, as per this post
(not sure if
gs
for ghostscript is actually required; I installed it anyway)