For official Ubuntu documentation where the source English files are in docbook xml, there is a requirement of ASCII only characters. We use a "checker" command line (see here).
grep --color='auto' -P -n "[\x80-\xFF]" *.xml
However, the command has a flaw, apparently not on all computers, it misses some lines with non-ASCII characters, potentially resulting in a false O.K. result.
Does anyone have a better suggestion for a ASCII checker command line?
Interested persons might consider to use this file (text file, not a docbook xml file) as a test case. The first three lines with non ASCII characters are lines 9, 14 and 18. Lines 14 and 18 were missed in the check:
$ grep --color='auto' -P -n "[\x80-\xFF]" install.en.txt | head -13
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community
520:1.2.1.4. Freedom and Philosophy
534:1.2.1.5. Ubuntu and other Debian derivatives
555:1.3. What is GNU/Linux?
Best Answer
If you want to look for non-ASCII characters, perhaps you should invert the search to exclude ASCII characters:
For example:
In lines 9, 330, 337 and 359, Unicode non-breaking space characters are present.
The particular output you get maybe due to
grep
's support for UTF-8. For a Unicode locale, some of those characters may compare equal to a normal ASCII character. Forcing the C locale will show the expected results in that case: