Sed XML – Remove Nodes with Namespace via Command Line


I have an xml file that contains the tag </w:rPr> several times. It is used like this

<w:rPr><w:rFonts w:ascii="Symbol" w:hAnsi="Symbol" w:hint="default"/></w:rPr>

However the content between the tag itself is sometimes different. Could there be a way to use sed or something other to delete everything between <w:rPr> and </w:rPr> and then both tags as well?

The relevant namespace


And part of the file itself (formatted, valid XML)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:numbering xmlns:wpc="" xmlns:cx="" xmlns:cx1="" xmlns:cx2="" xmlns:cx3="" xmlns:cx4="" xmlns:cx5="" xmlns:cx6="" xmlns:cx7="" xmlns:cx8="" xmlns:mc="" xmlns:aink="" xmlns:am3d="" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="" xmlns:m="" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="" xmlns:wp="" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="" xmlns:w14="" xmlns:w15="" xmlns:w16cid="" xmlns:w16se="" xmlns:wpg="" xmlns:wpi="" xmlns:wne="" xmlns:wps="" mc:Ignorable="w14 w15 w16se w16cid wp14">
  <w:abstractNum w:abstractNumId="0" w15:restartNumberingAfterBreak="0">
    <w:nsid w:val="FFFFFF89"/>
    <w:multiLevelType w:val="singleLevel"/>
    <w:tmpl w:val="CB2CEC0E"/>
    <w:lvl w:ilvl="0">
      <w:start w:val="1"/>
      <w:numFmt w:val="bullet"/>
      <w:pStyle w:val="Aufzhlungszeichen"/>
      <w:lvlText w:val="ï‚·"/>
      <w:lvlJc w:val="left"/>
          <w:tab w:val="num" w:pos="360"/>
        <w:ind w:left="360" w:hanging="360"/>
        <w:rFonts w:ascii="Symbol" w:hAnsi="Symbol" w:hint="default"/>

  <!-- ... -->

 <w:abstractNum w:abstractNumId="16" w15:restartNumberingAfterBreak="0">
    <w:nsid w:val="6F8046F9"/>
    <w:multiLevelType w:val="hybridMultilevel"/>
    <w:tmpl w:val="1F3A6CE4"/>
    <w:lvl w:ilvl="0" w:tplc="DE32BBA8">
      <w:start w:val="1"/>
      <w:numFmt w:val="lowerLetter"/>
      <w:lvlText w:val="%1)"/>
      <w:lvlJc w:val="left"/>
        <w:ind w:left="682" w:hanging="567"/>
        <w:rFonts w:ascii="Arial" w:eastAsia="Arial" w:hAnsi="Arial" w:cs="Arial" w:hint="default"/>
        <w:spacing w:val="-1"/>
        <w:w w:val="100"/>
        <w:sz w:val="22"/>
        <w:szCs w:val="22"/>
        <w:lang w:val="de-DE" w:eastAsia="de-DE" w:bidi="de-DE"/>

    <!-- ... -->

    <w:lvl w:ilvl="8" w:tplc="E4341C34">
      <w:numFmt w:val="bullet"/>
      <w:lvlText w:val="•"/>
      <w:lvlJc w:val="left"/>
        <w:ind w:left="7581" w:hanging="567"/>
        <w:rFonts w:hint="default"/>
        <w:lang w:val="de-DE" w:eastAsia="de-DE" w:bidi="de-DE"/>

  <!-- ... -->

  <w:num w:numId="1">
    <w:abstractNumId w:val="15"/>
  <w:num w:numId="2">
    <w:abstractNumId w:val="6"/>

  <!-- ... -->


Best Answer

Sure, it's a task for (a proper XML parser) and his friend , like this:

xmlstarlet ed -L \
              -N w="" \
              -d '//w:rPr' file.xml

A bit of explanations :

  • -L edit the file on the fly like sed -i
  • -N set the XML namespace, if needed
  • -d remove nodes matching xpath expression

Check xmlstarlet edit --help


please, never ever use for this task !

Everytime you use sed for html or xml, you kill a kitty

theory :

According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a :

You can use one of the following :

xmllint often installed by default with libxml2, xpath1

xmlstarlet can edit, select, transform... Not installed by default, xpath1

xpath installed via perl's module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

's lxml (from lxml import etree)

's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

, check this example

DOMXpath, check this example

Check: Using regular expressions with HTML tags

enter image description here

Related Question