I have downloaded an html file autogenerated by a script on a webpage.
The file contains multiple links, including links to images
I am trying to extract the full names of the images, for example
<a href="000000.jpg" title="image name.jpg" target="_blank">Image name.jpg</a>
from the above I want to get "Image name.jpg"
stored in a file. Since there are hundreds of these, I parse the file and store each name as it comes up with the following command:
grep -i -E -o "target=\"_blank\">([[:graph:]]*)\.(jpg|png|gif|webm)" "$thread" | cut -f 2 -d '>' | sed 's/ /_/g' - > "$names"
where "$thread
" is the name of the html file, "$names
" is the list of filenames as output. I use "cut
" to remove the 'target="_blank">'
portion, then convert the spaces to underscores.
Since there are several other links in the file, I specify the extensions to grab (images and webm). everything else should be ignored. I got it to the point where it is grabbing these links only, but then it misses some.
Some files contain spaces and non-alphanumeric characters. If I use [[:print:]]
which should cover all these cases i get nothing, or I get a bit of the <head>
portion of html and nothing else. If I use [[:graph:][:space:]]
, I also get nothing. If I just use [[:graph:]]
(as above) or [[:alnum:][:punct:]]
I can get files with alphanumeric/other characters (like "filenamewith(parenthesis).jpg
"), but not spaces, and the reverse is also true, [[:alnum:][:space:]]
works but omits the other printable characters ("file name with spaces.jpg
" works but not "with(parenthesis,comma or other.jpg").
Supposedly [[:print:]]
covers all cases but I don't get what I need, which if I'm understanding correctly,
grep -E -o
should only match (per the case above) :
*.jpg *.png *.gif
or *.webm
I have tried grep
with and without -E/-o/-e
in different variations.
Any ideas? I am using Arch Linux, grep 2.20, bash 4.3.18
Best Answer
The best strategy would be to use a proper html parser that can spit out the value of all
<a>
tags.Here,
xmlstarlet
is specifically an XML parser, and your HTML may not be well-formed XML, but you might get the idea: