Extracting part of lines with specific pattern using awk or sed

awksedtext processing

I have a question regarding the awk/sed operators. I have a big file which has the following set of lines repeated

Expression loweWallrhoPhi :  sum=-6.97168e-09
Expression leftWallrhoPhi :  sum=6.97168e-09
Expression lowerWallPhi :  sum=-5.12623e-12
Expression leftWallPhi :  sum=5.12623e-12
Expression loweWallrhoUSf :  sum=-6.936e-09
Expression leftWallrhoUSf :  sum=6.97169e-09
Expression lowerWallUSf :  sum=-5.1e-12
Expression leftWallUSf :  sum=5.12624e-12

I want to extract the value after the keyword sum in each case into a separate file. Is it possible to do so at one go?

Best Answer

With `grep`:

grep -oP 'sum=\K.*' inpufile > outputfile

grep with -P(perl-regexp) parameter supports \K, which use to ignoring the previously matched characters.

With `awk`:

awk -F"=" '{ print $NF; }' inputfile > outputfile

in awk the variable NF represent the total number of fields in a current record/line which is point to the last field number too and so $NF is its value accordingly.

With `sed`:

sed 's/^.*sum=//' inpufile > outputfile

^.*=sum replace all characters(.*) between starting of line(^) and last characters(sum=) with whitespace char.

Result:

-6.97168e-09
6.97168e-09
-5.12623e-12
5.12623e-12
-6.936e-09
6.97169e-09
-5.1e-12
5.12624e-12

With `cut`:

cut -d'=' -f2 inputfile > outputfile

if you want save same values into a same file and each separately, with awk you can do:

awk -F"=" '{print $NF >($NF); }' inputfile > outputfile

Examples:

Modifying FS:

awk -F" +|;|=" '

$3 == "gene" {
    printf("%s\t%s\t%s\t%s\t%s\t%s\t\n",
    $1, $4, $5, $10, $6, $7);
}
' data.file

Using split:

awk '
$3 == "gene" {
    split($9, a, ";")
    printf("%s\t%s\t%s\t%s\t%s\t%s\t\n",
    $1, $4, $5, substr(a[1], 3), $6, $7);
}
' data.file

OFS and FS:

Output Field Separator (OFS) as tab, and alternative FS inside awk. Also updated FS to include tab:

awk '
BEGIN {
    FS="[ \t]+|;|="
    OFS="\t"
}
$3 == "gene" {
    print $1, $4, $5, $10, $6, $7
}

' data.file

Also see The Open Group Variables and Special Variables, Examples.

Gawk manual – it usually is noted when things are a gawk extension to awk.

Sed/awk replace a specific pattern under another pattern

Sed can handle this quite easily. It's a single "substitute" command, prefixed with an address range. I've added extra spacing for better readability:

sed -e '/^\[ABC\]$/ , /^\[.*\]$/     s/^\(value1=\).*$/\1notbla/'

Without the extra spacing, it's:

sed -e '/^\[ABC\]$/,/^\[.*\]$/s/^\(value1=\).*$/\1notbla/'

You don't really need anchored regexes, but they may be safer in some cases of unusual inputs. A slightly shorter version with unanchored regexes is:

sed -e '/\[ABC\]/,/^\[/s/^\(value1=\).*$/\1notbla/'

Explanation:

You asked for each flag or option to be explained, and I've got the time, so here you go. I'm explaining the final (shortest) version out of the three Sed commands listed above.

The first part of the line is an address range: /startregex/,/stopregex/ The substitute command which follows the address range is only applied to lines from startregex to stopregex (inclusive).

In this case the start regex is /\[ABC\]/. Square brackets are usually special characters within a regex, so we put a backslash before each to signify literal square bracket characters.

The stop regex is /^\[/, which uses the special regex character ^ to signify the start of a line. This pattern will match any line that starts with a literal left square bracket ([).

The substitute command is basically quite simple; the general format is s/findregex/replacetext/. It can also have special flags placed after the final / to modify its behavior, but I'm not using any such flags here.

The "find regex" is ^$value1=$.*$.

The caret (^) matches the start of the line, as mentioned earlier, and the dollar sign ($) matches the end of the line. So this whole pattern must match an entire line, not merely part of one.

The parentheses (()), unlike square brackets, are non-special by default in regexes, so we put the backslashes before them to give them their special meaning. They allow parts of the matched text (the text matched by the "find regex") to be used in the replacement text. Specifically, the \1 in the replacement text means, "The text matched within the first set of parentheses in the regex." In this case, that is always just "value1=".

The final element in the "find regex" is .*. The dot (.) means "any single character," and the asterisk (*) means "any number of times (zero or more)." So the dot star (.*) matches the entire rest of the line, after the equals sign.

"notbla" in the replacement text is just static text, nothing special about it.

To really learn Sed properly, I highly recommend the Grymoire Sed tutorial, which is free online.

Best Answer

With grep:

With awk:

With sed:

With cut:

Related Solutions

Remove string from a particular field using awk/sed