Text Processing – Robust Way to Edit and Replace Pattern Matched

awksedtext processing

Is there a way to edit a matched pattern and then replace another pattern with the edited pattern?

Input:

a11.t
some text here
a06.t
some text here

Output:

a11.t 11
some text here
a06.t 06
some text here

The above example shows the first two digits (matched by first pattern) extracted and placed at the end of the line (second pattern).

In a programming language, I would load the file into a data structure, edit, replace, and write to a new file. But is there a one-line equivalent?

Trial:

sed 's/\(a[0-9][0-9].*\)/& \1/I' stack.fa | sed -e 's#a##g2' -e 's#\.\w##g2'

Trial output:

a11.t 11
some text here
a06.t 06
some text here

Obviously the trial works, but is there a more robust way? Further, is there another text processing language this could done in more easily?

Best Answer

sed here is the perfect tool for the task. However note that you almost never need to pipe several sed invocations together as a sed script can be made of several commands.

If you wanted to extract the first sequence of 2 decimal digits and append following a space to end of the line if found, you'd do:

sed 's/\([[:digit:]]\{2\}\).*$/& \1/' < your-file

If you wanted to do that only if it's found in second position on the line and following a a:

sed 's/^a\([[:digit:]]\{2\}\).*$/& \1/' < your-file

And if you don't want to do it if that sequence of 2 digits is followed by more digits:

sed 's/^a\([[:digit:]]\{2\}\)\([^[:digit:]].*\)\{0,1\}$/& \1/' < your-file

In terms of robustness it all boils down to answering the question: what should be matched? and what should not be?. That's why it's important to specify your requirements clearly, and also understand what the input may look like (like can there be digits in the lines where you don't want to find a match?, can there be non-ASCII characters in the input?, is the input encoded in the locale's charset? etc.).

Above, depending on the sed implementation, the input will be decoded into text based on the locale's charmap (see output of locale charmap), or interpreted as if each byte corresponded to a character and bytes 0 to 127 interpreted as per the ASCII charmap (assuming you're not on a EBCDIC based system).

For sed implementations in the first category, it may not work properly if the file is not encoded in the right charset. For those in the second category, it could fail if there are characters in the input whose encoding contains the encoding of decimal digits.

Related Question