Ubuntu – ASCII source file checker

command linedocumentationtext processing

For official Ubuntu documentation where the source English files are in docbook xml, there is a requirement of ASCII only characters. We use a "checker" command line (see here).

grep --color='auto' -P -n "[\x80-\xFF]" *.xml

However, the command has a flaw, apparently not on all computers, it misses some lines with non-ASCII characters, potentially resulting in a false O.K. result.

Does anyone have a better suggestion for a ASCII checker command line?

Interested persons might consider to use this file (text file, not a docbook xml file) as a test case. The first three lines with non ASCII characters are lines 9, 14 and 18. Lines 14 and 18 were missed in the check:

$ grep --color='auto' -P -n "[\x80-\xFF]" install.en.txt | head -13
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community
520:1.2.1.4. Freedom and Philosophy
534:1.2.1.5. Ubuntu and other Debian derivatives
555:1.3. What is GNU/Linux?

Best Answer

If you want to look for non-ASCII characters, perhaps you should invert the search to exclude ASCII characters:

grep -Pn '[^\x00-\x7F]'

For example:

$ curl https://help.ubuntu.com/16.04/installation-guide/amd64/install.en.txt -s | grep -nP '[^\x00-\x7F]' | head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

In lines 9, 330, 337 and 359, Unicode non-breaking space characters are present.

The particular output you get maybe due to grep's support for UTF-8. For a Unicode locale, some of those characters may compare equal to a normal ASCII character. Forcing the C locale will show the expected results in that case:

$ LANG=C grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

$ LANG=en_GB.UTF-8 grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community

Related Solutions

Ubuntu – Add lines from file to another file

You can read both files together line by line to get your desired output. ( I assume you don't have any other unwanted lines in these files )

while read -r line1 && read -r line2 <&3;
do
    echo $line1
    echo $line2

done<users.txt 3<mails.txt

users.txt is read using standard input file descriptor 0

mails.txt is read using our given file descriptor 3

Output:

johnny
johnny@email.com


james
james@email.com


clara
clara@email.com


brandon
brandon@email.com


steve
steve@email.com


louis
louis@email.com


daniel
daniele@email.com

Ubuntu – Replace second instance of string in a line in an ASCII file using Bash

Well, if it is the end of the line...

$ sed 's/\.png$/.mat/' file
file1.png otherfile1.mat
file2.png otherfile2.mat
file3.png otherfile3.mat

s/old/new/ search and replace
\. literal dot (without the escape it matches any character)
$ end of line

Or to explicitly specify the second column, you could use an awk way...

$ awk 'gsub(".png", ".mat", $2)' file
file1.png otherfile1.mat
file2.png otherfile2.mat
file3.png otherfile3.mat

gsub(old, new, where) search and replace
$2 second column

Best Answer

Related Solutions

Ubuntu – Add lines from file to another file

Ubuntu – Replace second instance of string in a line in an ASCII file using Bash

Related Question