Wget – save all data (images) from given directory

wget

I want wget to scan for sub and sub-subdirectories of:

domain.com/profile/username/albums/

Then grab every .jpg file from their sources.

Wget should get files like:

domain.com/profile/username/albums/album1/43434

(…)

domain.com/profile/username/albums/album6/4343

And download every image from their sources (unfortunately these images are on different server).

Is this possible?

I've been playing with -p -A .jpg and -r 1/2/3/4/5 but it grabs everything, like:

domain.com/profile/anotherusername/albums

domain.com/site/contactus

domain.com/site/anothersite

commercials-for-domain.com/banner/

etc.

wget -E -H -k -K -p domain.com/profile/username/albums/album1/43434

Works perfectly but with only one page; I'm not sure how to "scan" for different albums and files.

I need to do this because a friend of mine got her computer stolen and all her pictures are on this page and nowhere else. There are almost 200 of them with div overlays above them so it's hard to save them manually!

[edit]

The path tree looks exactly like this:

First level:

domain.com/profile/username/albums/

Second level:

domain.com/profile/username/albums/1,My Birthday Photos/

domain.com/profile/username/albums/2,Photos_From_2011-09-25/

Third level:

domain.com/profile/username/albums/1,My Birthday Photos/75893989,

domain.com/profile/username/albums/2,Photos_From_2011-09-25/74893213,

Best Answer

OK.. All photos in the 2 albums are retrieved.

As to how, and as can be ascertained from comments I made and with michail's remarks.

There are two albums..
http://www.fotka.pl/profil/AlekSanDraa2601/albumy/ one has 100 photos, the other 63 photos.

Here the one with 100 of them http://www.megaupload.com/?d=30RWXKN9 Here's the album with 63 of them http://www.megaupload.com/?d=CC27NM41

Taking the source code from here, first album http://www.fotka.pl/profil/AlekSanDraa2601/albumy/1,Ja/74892555

Extracting the image URLs All the thumbnails end in _72_p.jpg We don't want them we want the larger versions, they require in the URL that amin.fotka be changed to a.fotka, and _72_p be changed to _500_s

This is the same for the second album.. so for example, for the second album with 63 photos http://www.fotka.pl/profil/AlekSanDraa2601/albumy/2,Fotki_z_2011-09-25/75893982,,1319485161

here is blist3.txt A list with all the JPGs listed in _72_p form http://pastebin.com/raw.php?i=Y2nXfAXT

You can get that with a line like this..

C:\>type source.txt | grep -oE "http://.*?\.jpg"  >urls

edit the source to remove any miscellaneous parts.. like HTML attributes, obvious things that shouldn't be there. 

or use this line which is better and should just get them all without anything miscellaneous to remove.

C:\>type source.txt | grep -oE "http://[^ ]*\.jpg"  >urls

You have more URLs than you want there, for the second album, that command gives 97 and you only want the ones with _72_p in the URL

So | grep -E "72_p" so you get a list of just the photos you want.

C:\>type list.txt | wc -l

63 see there are 63 in that file, the right number.

that is all of them in that album. All 63

wget -i list.txt -w 3

http://www.megaupload.com/?d=CC27NM41

So that's all of them, all 163(100+63) of them, from the two albums.

This is the line one would use to take a list of the JPGs
listps2.txt is a file with all JPGs, both relevant ones and irrelevant ones. The relevant ones are in 72_p form, extract the relevant ones with grep. And change them with SED. put them in "thatfile", and you can then do wget -i thatfile -w 3. As I did.

C:\>type listps2.txt | grep "72_p" | sed "s/_72_p/_500_s/" | sed "s/amin\.fotka/a.fotka/" >thatfile

C:\>wget -i thatfile
Related Question