Text Processing – Robust Way to Edit and Replace Pattern Matched

awksedtext processing

Is there a way to edit a matched pattern and then replace another pattern with the edited pattern?

Input:

a11.t
some text here
a06.t
some text here

Output:

a11.t 11
some text here
a06.t 06
some text here

The above example shows the first two digits (matched by first pattern) extracted and placed at the end of the line (second pattern).

In a programming language, I would load the file into a data structure, edit, replace, and write to a new file. But is there a one-line equivalent?

Trial:

sed 's/\(a[0-9][0-9].*\)/& \1/I' stack.fa | sed -e 's#a##g2' -e 's#\.\w##g2'

Trial output:

a11.t 11
some text here
a06.t 06
some text here

Obviously the trial works, but is there a more robust way? Further, is there another text processing language this could done in more easily?

Best Answer

sed here is the perfect tool for the task. However note that you almost never need to pipe several sed invocations together as a sed script can be made of several commands.

If you wanted to extract the first sequence of 2 decimal digits and append following a space to end of the line if found, you'd do:

sed 's/\([[:digit:]]\{2\}\).*$/& \1/' < your-file

If you wanted to do that only if it's found in second position on the line and following a a:

sed 's/^a\([[:digit:]]\{2\}\).*$/& \1/' < your-file

And if you don't want to do it if that sequence of 2 digits is followed by more digits:

sed 's/^a\([[:digit:]]\{2\}\)\([^[:digit:]].*\)\{0,1\}$/& \1/' < your-file

In terms of robustness it all boils down to answering the question: what should be matched? and what should not be?. That's why it's important to specify your requirements clearly, and also understand what the input may look like (like can there be digits in the lines where you don't want to find a match?, can there be non-ASCII characters in the input?, is the input encoded in the locale's charset? etc.).

Above, depending on the sed implementation, the input will be decoded into text based on the locale's charmap (see output of locale charmap), or interpreted as if each byte corresponded to a character and bytes 0 to 127 interpreted as per the ASCII charmap (assuming you're not on a EBCDIC based system).

For sed implementations in the first category, it may not work properly if the file is not encoded in the right charset. For those in the second category, it could fail if there are characters in the input whose encoding contains the encoding of decimal digits.

Explanation:

sed 's/^ip/\nip/' file : add an extra newline (\n) to each line beginning with ip. I think this might not work with all implementations of sed, so if yours doesn't support this, replace the sed command with perl -pe 's/^ip/\nip/'. We need this in order to use Perl's "paragraph mode" (seen below).
perl -00pe : the -00 makes perl run in "paragraph mode" where a "line" is defined by two consecutive newlines. This enables us to treat each host's block as a single "line". The -pe means "print each line after applying the script given by -e to it".
if(/\nhost=c\n/){s/ip=\S+/ip=1.2.3.4/} : if this "line" (section) matches a newline followed by the string host=c and then another newline, then replace ip= and 1 or more non-whitespace characters (\S+) following it with ip=1.2.3.4.
s/\n\n/\n/ replace each pair of newlines with a single newline to get the original file's format back.

If you want this to change the file in place, you can use:

tmp=$(mktemp); sed 's/^ip/\nip/' file > $tmp; 
perl -00pe 'if(/\nhost=c\n/){s/ip=\S+/ip=1.2.3.4/} s/\n\n/\n/' $tmp > file

Shell – Replace nth Line from the Matched Pattern

Following your approach,

tac file|sed '/juice/{n;n;s/.*/coconut/}'|tac

/juice/ matches a line with juice.
n;n; prints the current and the next line.
s/.*/coconut/ makes the substitution.

Apparently you have GNU sed, so you could also use -z to get the whole file into memory and directly edit the line two above juice,

sed -rz 's/[^\n]*(\n[^\n]*\n[^\n]*juice)/coconut\1/' file

[^\n] means "not a newline" and the parenthesis () capture the group reproduced by the \1 back-reference.

Best Answer

Related Solutions

How to edit the entire file after match a grep pattern

Explanation:

Shell – Replace nth Line from the Matched Pattern

Related Question