Bash – extract filenames from html file containing multiple links

bashgrephtmlregular expressiontext processing

I have downloaded an html file autogenerated by a script on a webpage.
The file contains multiple links, including links to images
I am trying to extract the full names of the images, for example

<a href="000000.jpg" title="image name.jpg" target="_blank">Image name.jpg</a>

from the above I want to get "Image name.jpg" stored in a file. Since there are hundreds of these, I parse the file and store each name as it comes up with the following command:

grep -i -E -o "target=\"_blank\">([[:graph:]]*)\.(jpg|png|gif|webm)" "$thread" | cut -f 2 -d '>' | sed 's/ /_/g' - > "$names"

where "$thread" is the name of the html file, "$names" is the list of filenames as output. I use "cut" to remove the 'target="_blank">' portion, then convert the spaces to underscores.

Since there are several other links in the file, I specify the extensions to grab (images and webm). everything else should be ignored. I got it to the point where it is grabbing these links only, but then it misses some.

Some files contain spaces and non-alphanumeric characters. If I use [[:print:]] which should cover all these cases i get nothing, or I get a bit of the <head> portion of html and nothing else. If I use [[:graph:][:space:]], I also get nothing. If I just use [[:graph:]] (as above) or [[:alnum:][:punct:]] I can get files with alphanumeric/other characters (like "filenamewith(parenthesis).jpg"), but not spaces, and the reverse is also true, [[:alnum:][:space:]] works but omits the other printable characters ("file name with spaces.jpg" works but not "with(parenthesis,comma or other.jpg").

Supposedly [[:print:]] covers all cases but I don't get what I need, which if I'm understanding correctly,

grep -E -o should only match (per the case above) :
*.jpg *.png *.gif or *.webm

I have tried grep with and without -E/-o/-e in different variations.

Any ideas? I am using Arch Linux, grep 2.20, bash 4.3.18

Best Answer

The best strategy would be to use a proper html parser that can spit out the value of all <a> tags.

Here, xmlstarlet is specifically an XML parser, and your HTML may not be well-formed XML, but you might get the idea:

echo '<html>
<a href="000000.jpg" title="image name.jpg" target="_blank">Image name.jpg</a>
</html>' | xmlstarlet sel -t -v //a

Image name.jpg

Related Solutions

Bash – For each subfolder, sort files by name and rename them to sequential padded numbers (regardless of extension)

for dir in */*; do           # loop over the directories
    (                        # run in a subshell ...
        cd "$dir"            # ... so we don't have to cd back
        files=(*)            # store the filenames in a zero-indexed array

        for index in "${!files[@]}"; do
            file=${files[$index]}
            ext=${file##*.}
            newname=$(printf "%02d.%s" $((index+1)) "$ext")
            mv "$file" "$newname"
        done
    )
done

Suppose you have a file with no extension. In that case it will have the same name except with leading numbers (e.g. my_file => 05.my_file)

All non-hidden directory entries will be renamed, including directories.

Bash copy all files that don’t match the given extensions

You can use find to find all files in a directory tree that match (or don't match) some particular tests, and then to do something with them. For this particular problem, you could use:

find -type f ! \( -iname '*.png' -o -iname '*.gif' -o -iname '*.jpg' -o -iname '*.xcf' \) -exec echo mv {} /new/path \;

This limits the search to regular files (-type f), and then to files whose names do not (!) have the extension *.png in any casing (-iname '*.png') or (-o) *.gif, and so on. All the extensions are grouped into a single condition between $ ... $. For each matching file it runs a command (-exec) that moves the file, the name of which is inserted in place of the {}, into the directory /new/path. The \; tells find that the command is over.

The name substitution happens inside the program-execution code, so spaces and other special characters don't matter.

If you want to do this just inside Bash, you can use Bash's extended pattern matching features. These require that shopt extglob is on, and globstar too. In this case, use:

mv **/!(*.[gG][iI][fF]|*.[pP][nN][gG]|*.[xX][cC][fF]|*.[jJ][pP][gG]) /new/path

This matches all files in subdirectories (**) that do not match *.gif, *.png, etc, in any combination of character cases, and moves them into the new path. The expansion is performed by the shell, so spaces and special characters don't matter again.

The above assumes all files are in subdirectories. If not, you can repeat the part after **/ to include the current directory too.

There are similar features in zsh and other shells, but you've indicated you're using Bash.

(A further note: parsing ls is never a good idea - just don't try it.)

Best Answer

Related Solutions

Bash – For each subfolder, sort files by name and rename them to sequential padded numbers (regardless of extension)

Bash copy all files that don’t match the given extensions

Related Question