Text Processing with Sed and Grep – Return Only the Portion of a Line After a Matching Pattern

grepsedtext processing

So pulling open a file with cat and then using grep to get matching lines only gets me so far when I am working with the particular log set that I am dealing with. It need a way to match lines to a pattern, but only to return the portion of the line after the match. The portion before and after the match will consistently vary. I have played with using sed or awk, but have not been able to figure out how to filter the line to either delete the part before the match, or just return the part after the match, either will work.
This is an example of a line that I need to filter:

2011-11-07T05:37:43-08:00 <0.4> isi-udb5-ash4-1(id1) /boot/kernel.amd64/kernel: [gmp_info.c:1758](pid 40370="kt: gmp-drive-updat")(tid=100872) new group: <15,1773>: { 1:0-25,27-34,37-38, 2:0-33,35-36, 3:0-35, 4:0-9,11-14,16-32,34-38, 5:0-35, 6:0-15,17-36, 7:0-16,18-36, 8:0-14,16-32,34-36, 9:0-10,12-36, 10-11:0-35, 12:0-5,7-30,32-35, 13-19:0-35, 20:0,2-35, down: 8:15, soft_failed: 1:27, 8:15, stalled: 12:6,31, 20:1 }

The portion I need is everything after "stalled".

The background behind this is that I can find out how often something stalls:

cat messages | grep stalled | wc -l

What I need to do is find out how many times a certain node has stalled (indicated by the portion before each colon after "stalled". If I just grep for that (ie 20:) it may return lines that have soft fails, but no stalls, which doesn't help me. I need to filter only the stalled portion so I can then grep for a specific node out of those that have stalled.

For all intents and purposes, this is a freebsd system with standard GNU core utils, but I cannot install anything extra to assist.

Best Answer

The canonical tool for that would be sed.

sed -n -e 's/^.*stalled: //p'

Detailed explanation:

  • -n means not to print anything by default.
  • -e is followed by a sed command.
  • s is the pattern replacement command.
  • The regular expression ^.*stalled: matches the pattern you're looking for, plus any preceding text (.* meaning any text, with an initial ^ to say that the match begins at the beginning of the line). Note that if stalled: occurs several times on the line, this will match the last occurrence.
  • The match, i.e. everything on the line up to stalled:, is replaced by the empty string (i.e. deleted).
  • The final p means to print the transformed line.

If you want to retain the matching portion, use a backreference: \1 in the replacement part designates what is inside a group \(…\) in the pattern. Here, you could write stalled: again in the replacement part; this feature is useful when the pattern you're looking for is more general than a simple string.

sed -n -e 's/^.*\(stalled: \)/\1/p'

Sometimes you'll want to remove the portion of the line after the match. You can include it in the match by including .*$ at the end of the pattern (any text .* followed by the end of the line $). Unless you put that part in a group that you reference in the replacement text, the end of the line will not be in the output.

As a further illustration of groups and backreferences, this command swaps the part before the match and the part after the match.

sed -n -e 's/^\(.*\)\(stalled: \)\(.*\)$/\3\2\1/p'
Related Question