Re 1) The answer is the same as for any other tool that was improved over decades. :)
You don't want to break existing scripts by changing default behaviour.
Re 2) That has nothing to do with the matching engine; it's just a question of which set of regular expressions is supported. POSIX BRE means "basic regular expression".
Sure, it's a task for xmlstarlet (a proper XML parser) and his friend xpath, like this:
xmlstarlet ed -L \
-N w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" \
-d '//w:rPr' file.xml
A bit of explanations :
-L
edit the file on the fly like sed -i
-N
set the XML namespace, if needed
-d
remove nodes matching xpath
expression
Check xmlstarlet edit --help
TL;DR
please, never ever use sed for this task !
Everytime you use sed
for html
or xml
, you kill a kitty
theory :
According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2
, xpath1
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml
(from lxml import etree
)
perl's XML::LibXML
, XML::XPath
, XML::Twig::XPath
, HTML::TreeBuilder::XPath
ruby nokogiri, check this example
php DOMXpath
, check this example
Check: Using regular expressions with HTML tags
Best Answer
sed
backreferences have the form\1
,\2
, etc.$1
is more Perl-like. Also, if using basic regular expressions (BRE), you need to escape the parentheses(...)
forming a group, as well as?
and+
. Or you can use extended regular expressions with the-E
option.Note that sed regexes are greedy, so
<.*>
will match<h> PIDAT <h>
in that line, instead of stopping at the first>
. And.*?
does not make sense (.*
already can match nothing, so making it optional via?
is unnecessary).This might work:
[^>]
matches everything except>
, so<[^>]*>
will match<h>
but not<h> PIDAT <h>
.