Using Bash to split XML data into variables

curlscriptingvariablexmlxmllint

I am trying to download some files from a services. The files are found in an XML file. The XML file can have a single file or several files to download. However, now I have a problem with my script. I do not know how to split string from XMLLINT into array so that I can download each file individually.

I need to split the string into several variables and then download each file of the URL string.

However the file 201701_1 do not repeat and hence, I download them using curl without any problems. But the files coverage.zip repeat and they become overwritten by curl.
I do:
Then I do curl to download individual files.

curl -O -b cookie $URL 

At the moment, my script is as follows:

while read edition; do   XML="<?xml version=\"1.0\"
encoding=\"UTF-8\"?> <download-area>   <files>
    <file>
      <url>https://google.com/411/201701_01_01.zip</url>
    </file>
    <file>
      <url>https://google.com/411/201701_01_02.zip</url>
    </file>   </files> </download-area>
    "
    URL=$(echo $XML | xmllint --xpath \
    "/*[name()='download-area']/*[name()='files']/*[name()='file']/*[name()='url']/text()" -)

    echo "URL:: " $URL

done < $LATEST_EDITION

LATEST_EDITION is a simply a file with lines.

My question is::
How can I split VAR_1 and VAR_2 into several URLs so that I can download them individually?
How can I prevent coverage.zip from being overwritten?

Best Answer

xmllint is pretty useless to extract information from XML documents. You may want to consider xmlstarlet or xml_grep (from perl's XML::Twig) or xml2.

With xmllint, you could still extract one string at a time with:

VAR1=$(printf '%s\n' "$XML" |
  xmllint --xpath '/download-area/files/file[1]/url/text()' -)
VAR2=$(printf '%s\n' "$XML" |
  xmllint --xpath '/download-area/files/file[2]/url/text()' -)

For values like here not containing newline characters, you can use bash's readarray as:

readarray -t var < <(
  xmlstarlet sel -t -v /download-area/files/file/url  <<< "$XML")

Or

readarray -t var < <(
  xml2 <<< "$XML" | sed -n 's|^/download-area/files/file/url=||p')

Or:

readarray -t var < <(
  xml_grep --text_only /download-area/files/file/url <<< "$URL")
Related Question