I have a text file that looks like this:
English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ
English words only
Also English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ
Note that in the middle there, there are two lines, English words only
and Also English words only
, one right after the other.
What I need to do is take those two lines, and combine into one line separated by a /
, like this:
English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ
English words only / Also English words only
English and 日本語
日本語のみ
English words only
English and 日本語
日本語のみ
I've found that I can search for lines with ASCII characters with the following regular expression, [[:ascii:]]
, and for non-ASCII with [^[:ascii:]]
. However, I'm having a little trouble using regular expressions to find instances of not matching a condition, since what I need to search on are lines without non-ASCII characters.
I found this question about "inverse matching", but, the answers there are beyond me.
Then, of course, it's another problem to match lines based on their relationship to each other. Can I match these lines when they are one after the other? I'm not even sure that is possible.
Is there a way I can search for all lines with no non-ASCII characters, and then combine them, using LibreOffice, Gedit, or the command line?
Note that the file is thousands of lines long, and also I am not sure, but it might be possible that there could be occurrences of English only lines that are in groups of 3 or 4.
Best Answer
It seems like you can use
sed
to do this job, even though it doesn't know about the[[:ascii:]]
character class. Instead of that, we can specify all ASCII characters with a range of escape sequences[\d0-\d127]
, as long as we use theC
orPOSIX
locales.Here's a command that should be reliable:
Notes
LC_ALL=C
UseC
locale settings only for this command (otherwise you get an error)-r
Use extended regex to make the command more readable (we need fewer backslashes) (GNUsed
also recognises-E
with the same meaning).:a
Label - loop starts here;
Separates commands, like in the shellN
Read the next line into the pattern space, so we can replace\n
s|old|new|
Replaceold
withnew
^([\d0-\d127])\n([\d0-\d127]+)$
- match two lines with only ASCII and capture the first line in\1
and the second line in\2
.^
is start of line,\n
is a newline, and$
is end of line, so^line 1\nline 2$
tests the whole ofline 1
andline 2
.\1 / \2
The first and second lines, separated by/
instead of a newline.ta
- If the last search-and-replace command succeeded, execute the loop again. This allows us to process all the lines of the file, handling any instances where there are more than two all-ASCII lines together.Many thanks to Eliah Kagan for showing me how to use escape sequences to match ASCII characters.