$1 not working with sed

regular expressionsedxml

I have a bunch of files that contain XML tags like:

<h> PIDAT <h> O

I need to delete everything what comes after the first <h> in that line, so I can get this:

<h>

For that I'm using

sed -i -e 's/(^<.*?>).+/$1/' *.conll

But it seems that sed is not recognizing the $1. (As I understand, $1 should delete everything what is not contained in the group). Is there a way I can achieve this? I'd really appreciate if you could point me in the right direction.

PS: I tested those expressions on a regex app and they worked, but it is not working from the command line.

Best Answer

sed backreferences have the form \1, \2, etc. $1 is more Perl-like. Also, if using basic regular expressions (BRE), you need to escape the parentheses (...) forming a group, as well as ? and +. Or you can use extended regular expressions with the -E option.

Note that sed regexes are greedy, so <.*> will match <h> PIDAT <h> in that line, instead of stopping at the first >. And .*? does not make sense (.* already can match nothing, so making it optional via ? is unnecessary).

This might work:

sed -i -Ee 's/^(<[^>]*>).*/\1/' *.conll

[^>] matches everything except >, so <[^>]*> will match <h> but not <h> PIDAT <h>.

Related Solutions

Linux – Why isn’t sed using the extended regex mode by default

Re 1) The answer is the same as for any other tool that was improved over decades. :)

You don't want to break existing scripts by changing default behaviour.

Re 2) That has nothing to do with the matching engine; it's just a question of which set of regular expressions is supported. POSIX BRE means "basic regular expression".

Sed XML – Remove Nodes with Namespace via Command Line

Sure, it's a task for xmlstarlet (a proper XML parser) and his friend xpath, like this:

xmlstarlet ed -L \
              -N w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" \
              -d '//w:rPr' file.xml

A bit of explanations :

-L edit the file on the fly like sed -i
-N set the XML namespace, if needed
-d remove nodes matching xpath expression

Check xmlstarlet edit --help

TL;DR

please, never ever use sed for this task !

Everytime you use sed for html or xml, you kill a kitty

theory :

According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

You can use one of the following :

xmllint often installed by default with libxml2, xpath1

xmlstarlet can edit, select, transform... Not installed by default, xpath1

xpath installed via perl's module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

python's lxml (from lxml import etree)

perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

ruby nokogiri, check this example

php DOMXpath, check this example

Check: Using regular expressions with HTML tags