Bash – extract filenames from html file containing multiple links

bashgrephtmlregular expressiontext processing

I have downloaded an html file autogenerated by a script on a webpage.
The file contains multiple links, including links to images
I am trying to extract the full names of the images, for example

<a href="000000.jpg" title="image name.jpg" target="_blank">Image name.jpg</a>

from the above I want to get "Image name.jpg" stored in a file. Since there are hundreds of these, I parse the file and store each name as it comes up with the following command:

grep -i -E -o "target=\"_blank\">([[:graph:]]*)\.(jpg|png|gif|webm)" "$thread" | cut -f 2 -d '>' | sed 's/ /_/g' - > "$names"

where "$thread" is the name of the html file, "$names" is the list of filenames as output. I use "cut" to remove the 'target="_blank">' portion, then convert the spaces to underscores.

Since there are several other links in the file, I specify the extensions to grab (images and webm). everything else should be ignored. I got it to the point where it is grabbing these links only, but then it misses some.

Some files contain spaces and non-alphanumeric characters. If I use [[:print:]] which should cover all these cases i get nothing, or I get a bit of the <head> portion of html and nothing else. If I use [[:graph:][:space:]], I also get nothing. If I just use [[:graph:]] (as above) or [[:alnum:][:punct:]] I can get files with alphanumeric/other characters (like "filenamewith(parenthesis).jpg"), but not spaces, and the reverse is also true, [[:alnum:][:space:]] works but omits the other printable characters ("file name with spaces.jpg" works but not "with(parenthesis,comma or other.jpg").

Supposedly [[:print:]] covers all cases but I don't get what I need, which if I'm understanding correctly,

grep -E -o should only match (per the case above) :
*.jpg *.png *.gif or *.webm

I have tried grep with and without -E/-o/-e in different variations.

Any ideas? I am using Arch Linux, grep 2.20, bash 4.3.18

Best Answer

The best strategy would be to use a proper html parser that can spit out the value of all <a> tags.

Here, xmlstarlet is specifically an XML parser, and your HTML may not be well-formed XML, but you might get the idea:

echo '<html>
<a href="000000.jpg" title="image name.jpg" target="_blank">Image name.jpg</a>
</html>' | xmlstarlet sel -t -v //a
Image name.jpg
Related Question