Bash – sed doesn’t distinguish between full regex match and no match

bashsed

I want to extract portion of a string matching a regex. Consider the following code that works correctly:

regex="ss"
string="blossom"
echo $string | sed "s/^.*\($regex\).*$/\1/"

Output is:

ss

However if the regex matches nothing the whole string is returned.

regex="aa"

Output:

blossom

This is incorrect. When there is no match, nothing should be returned. How can this be accomplished?

Best Answer

As choroba said, sed will always print the line, by default, with any substitutions that matched. You could do what you want with:

regex="ss"
string="blossom"
echo $string | sed -n "s/^.*\($regex\).*$/\1/p"

The -n tells sed not to print the line, then the p at the end of the s/ command tells sed to print the line, with replacements, if it matched anything.

Related Solutions

Free BSD/ Mac OS X Sed: Print regex match and the line 5 lines after the match

Might be easier with awk:

awk '/foo/ {print; p[NR+5]; next}; NR in p'

Non-Greedy Match with SED Regex – Emulating Perl’s .*?

Sed regexes match the longest match. Sed has no equivalent of non-greedy.

What we want to do is match

AB,
followed by
any amount of anything other than AC,
followed by
AC

Unfortunately, sed can’t do #2 — at least not for a multi-character regular expression. Of course, for a single-character regular expression such as @ (or even [123]), we can do [^@]* or [^123]*. And so we can work around sed’s limitations by changing all occurrences of AC to @ and then searching for

AB,
followed by
any number of anything other than @,
followed by
@

like this:

sed 's/AC/@/g; s/AB[^@]*@/XXX/; s/@/AC/g'

The last part changes unmatched instances of @ back to AC.

But this is a reckless approach because the input could already contain @ characters. So, by matching them, we could get false positives. However, since no shell variable will ever have a NUL (\x00) character in it, NUL is likely a good character to use in the above work-around instead of @:

$ echo 'ssABteAstACABnnACss' | sed 's/AC/\x00/g; s/AB[^\x00]*\x00/XXX/; s/\x00/AC/g'
ssXXXABnnACss

The use of NUL requires GNU sed. (To make sure that GNU features are enabled, the user must not have set the shell variable POSIXLY_CORRECT.)

If you are using sed with GNU's -z flag to handle NUL-separated input, such as the output of find ... -print0, then NUL will not be in the pattern space and NUL is a good choice for the substitution here.

Although NUL cannot be in a bash variable it is possible to include it in a printf command. If your input string can contain any character at all, including NUL, then see Stéphane Chazelas' answer which adds a clever escaping method.

Best Answer

Related Solutions

Free BSD/ Mac OS X Sed: Print regex match and the line 5 lines after the match

Non-Greedy Match with SED Regex – Emulating Perl’s .*?

Related Question