Ubuntu – How to search for lines in a file that only contain ASCII characters and then act on them

command linelanguagelibreofficeregextext processing

I have a text file that looks like this:

English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ
English words only
Also English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ

Note that in the middle there, there are two lines, English words only and Also English words only, one right after the other.

What I need to do is take those two lines, and combine into one line separated by a /, like this:

English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ
English words only / Also English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ

I've found that I can search for lines with ASCII characters with the following regular expression, [[:ascii:]], and for non-ASCII with [^[:ascii:]]. However, I'm having a little trouble using regular expressions to find instances of not matching a condition, since what I need to search on are lines without non-ASCII characters.

I found this question about "inverse matching", but, the answers there are beyond me.

Then, of course, it's another problem to match lines based on their relationship to each other. Can I match these lines when they are one after the other? I'm not even sure that is possible.

Is there a way I can search for all lines with no non-ASCII characters, and then combine them, using LibreOffice, Gedit, or the command line?

Note that the file is thousands of lines long, and also I am not sure, but it might be possible that there could be occurrences of English only lines that are in groups of 3 or 4.

Best Answer

It seems like you can use sed to do this job, even though it doesn't know about the [[:ascii:]] character class. Instead of that, we can specify all ASCII characters with a range of escape sequences [\d0-\d127], as long as we use the C or POSIX locales.

Here's a command that should be reliable:

LC_ALL=C sed -r ':a;N;s|^([\d0-\d127]+)\n([\d0-\d127]+)$|\1 / \2|;ta' file

Notes

LC_ALL=C Use C locale settings only for this command (otherwise you get an error)
-r Use extended regex to make the command more readable (we need fewer backslashes) (GNU sed also recognises -E with the same meaning).
:a Label - loop starts here
; Separates commands, like in the shell
N Read the next line into the pattern space, so we can replace \n
s|old|new| Replace old with new
^([\d0-\d127])\n([\d0-\d127]+)$ - match two lines with only ASCII and capture the first line in \1 and the second line in \2. ^ is start of line, \n is a newline, and $ is end of line, so ^line 1\nline 2$ tests the whole of line 1 and line 2.
\1 / \2 The first and second lines, separated by / instead of a newline.
ta - If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.

Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.

Related Solutions

Ubuntu – Regular expressions VS Filename globbing

Regular expressions and file name globbing are two very different things.

Regular expressions are used in commands / functions for pattern matching in text. For example in the pattern parameter of grep, or in programming languages.

File name globbing is used by shells for matching file and directory names using wildcards. The capabilities of globbing depend on the shell. Bash, for example, supports wildcards like:

* match 0 or more characters
? match 1 character
[...] match a character in the specified set

These wildcards may look similar to regular expressions, indeed [...] has the same meaning in globbing and regex. But * and ? mean different things in globbing and regex.

In a comment you wrote:

but how the interpreter difference * whether it's a joker or regex ? for example : grep a*b a*.txt ?

Easy. Sort of.

First of all, the shell tries to interpret the wildcards, by matching them against filenames. If there are files starting with "a" and ending with "b", the shell will replace a*b with the matching filenames. Same goes for a*.txt. If there are no matching filenames, the shell will pass the arguments to grep as they were, literally.

However, the first parameter of grep should be a pattern. In 99.999% of practical use cases you don't want the first parameter to be interpreted by the shell. So most probably the intention was this:

grep "a*b" a*.txt

Thanks to quoting a*b, the shell will not interpret it using globbing, and instead pass it directly to grep. In turn, grep will interpret that as a regular expression (by design).

To sum it up, the shell interprets the command line following its own globbing language, which is using wildcards. Commands, programs interpret their parameters in whatever way they were designed by their authors.

Ubuntu – ASCII source file checker

If you want to look for non-ASCII characters, perhaps you should invert the search to exclude ASCII characters:

grep -Pn '[^\x00-\x7F]'

For example:

$ curl https://help.ubuntu.com/16.04/installation-guide/amd64/install.en.txt -s | grep -nP '[^\x00-\x7F]' | head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

In lines 9, 330, 337 and 359, Unicode non-breaking space characters are present.

The particular output you get maybe due to grep's support for UTF-8. For a Unicode locale, some of those characters may compare equal to a normal ASCII character. Forcing the C locale will show the expected results in that case:

$ LANG=C grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
14:(codename "‘Xenial Xerus’"), for the 64-bit PC ("amd64") architecture. It also
18:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
368:  • Ubuntu will always be free of charge, and there is no extra fee for the "
372:  • Ubuntu includes the very best in translations and accessibility
376:  • Ubuntu is shipped in stable and regular release cycles; a new release will
380:  • Ubuntu is entirely committed to the principles of open source software

$ LANG=en_GB.UTF-8 grep -Pn '[\x80-\xFF]' install.en.txt| head
9:Appendix F, GNU General Public License.
330:when things go wrong. The Installation Howto can be found in Appendix A,
337:Chapter 1. Welcome to Ubuntu
359:1.1. What is Ubuntu?
394:1.1.1. Sponsorship by Canonical
402:1.2. What is Debian?
456:1.2.1. Ubuntu and Debian
461:1.2.1.1. Package selection
475:1.2.1.2. Releases
501:1.2.1.3. Development community

Best Answer

Notes

Related Solutions

Ubuntu – Regular expressions VS Filename globbing

Ubuntu – ASCII source file checker

Related Question