I want wget to scan for sub and sub-subdirectories of:
domain.com/profile/username/albums/
Then grab every .jpg file from their sources.
Wget should get files like:
domain.com/profile/username/albums/album1/43434
(…)
domain.com/profile/username/albums/album6/4343
And download every image from their sources (unfortunately these images are on different server).
Is this possible?
I've been playing with -p -A .jpg and -r 1/2/3/4/5 but it grabs everything, like:
domain.com/profile/anotherusername/albums
domain.com/site/contactus
domain.com/site/anothersite
commercials-for-domain.com/banner/
etc.
wget -E -H -k -K -p domain.com/profile/username/albums/album1/43434
Works perfectly but with only one page; I'm not sure how to "scan" for different albums and files.
I need to do this because a friend of mine got her computer stolen and all her pictures are on this page and nowhere else. There are almost 200 of them with div overlays above them so it's hard to save them manually!
[edit]
The path tree looks exactly like this:
First level:
domain.com/profile/username/albums/
Second level:
domain.com/profile/username/albums/1,My Birthday Photos/
domain.com/profile/username/albums/2,Photos_From_2011-09-25/
Third level:
domain.com/profile/username/albums/1,My Birthday Photos/75893989,
domain.com/profile/username/albums/2,Photos_From_2011-09-25/74893213,
Best Answer
OK.. All photos in the 2 albums are retrieved.
As to how, and as can be ascertained from comments I made and with michail's remarks.
There are two albums..
http://www.fotka.pl/profil/AlekSanDraa2601/albumy/ one has 100 photos, the other 63 photos.
Here the one with 100 of them http://www.megaupload.com/?d=30RWXKN9 Here's the album with 63 of them http://www.megaupload.com/?d=CC27NM41
Taking the source code from here, first album http://www.fotka.pl/profil/AlekSanDraa2601/albumy/1,Ja/74892555
Extracting the image URLs All the thumbnails end in _72_p.jpg We don't want them we want the larger versions, they require in the URL that amin.fotka be changed to a.fotka, and _72_p be changed to _500_s
This is the same for the second album.. so for example, for the second album with 63 photos http://www.fotka.pl/profil/AlekSanDraa2601/albumy/2,Fotki_z_2011-09-25/75893982,,1319485161
here is blist3.txt A list with all the JPGs listed in _72_p form http://pastebin.com/raw.php?i=Y2nXfAXT
You can get that with a line like this..
edit the source to remove any miscellaneous parts.. like HTML attributes, obvious things that shouldn't be there.
or use this line which is better and should just get them all without anything miscellaneous to remove.
You have more URLs than you want there, for the second album, that command gives 97 and you only want the ones with _72_p in the URL
So | grep -E "72_p" so you get a list of just the photos you want.
63 see there are 63 in that file, the right number.
that is all of them in that album. All 63
wget -i list.txt -w 3
http://www.megaupload.com/?d=CC27NM41
So that's all of them, all 163(100+63) of them, from the two albums.
This is the line one would use to take a list of the JPGs
listps2.txt is a file with all JPGs, both relevant ones and irrelevant ones. The relevant ones are in 72_p form, extract the relevant ones with grep. And change them with SED. put them in "thatfile", and you can then do wget -i thatfile -w 3. As I did.