$1 not working with sed

regular expressionsedxml

I have a bunch of files that contain XML tags like:

<h> PIDAT <h> O

I need to delete everything what comes after the first <h> in that line, so I can get this:

<h>

For that I'm using

sed -i -e 's/(^<.*?>).+/$1/' *.conll

But it seems that sed is not recognizing the $1. (As I understand, $1 should delete everything what is not contained in the group). Is there a way I can achieve this? I'd really appreciate if you could point me in the right direction.

PS: I tested those expressions on a regex app and they worked, but it is not working from the command line.

Best Answer

sed backreferences have the form \1, \2, etc. $1 is more Perl-like. Also, if using basic regular expressions (BRE), you need to escape the parentheses (...) forming a group, as well as ? and +. Or you can use extended regular expressions with the -E option.

Note that sed regexes are greedy, so <.*> will match <h> PIDAT <h> in that line, instead of stopping at the first >. And .*? does not make sense (.* already can match nothing, so making it optional via ? is unnecessary).

This might work:

sed -i -Ee 's/^(<[^>]*>).*/\1/' *.conll

[^>] matches everything except >, so <[^>]*> will match <h> but not <h> PIDAT <h>.

Related Question