Wget – save all data (images) from given directory

wget

I want wget to scan for sub and sub-subdirectories of:

domain.com/profile/username/albums/

Then grab every .jpg file from their sources.

Wget should get files like:

domain.com/profile/username/albums/album1/43434

(…)

domain.com/profile/username/albums/album6/4343

And download every image from their sources (unfortunately these images are on different server).

Is this possible?

I've been playing with -p -A .jpg and -r 1/2/3/4/5 but it grabs everything, like:

domain.com/profile/anotherusername/albums

domain.com/site/contactus

domain.com/site/anothersite

commercials-for-domain.com/banner/

etc.

wget -E -H -k -K -p domain.com/profile/username/albums/album1/43434

Works perfectly but with only one page; I'm not sure how to "scan" for different albums and files.

I need to do this because a friend of mine got her computer stolen and all her pictures are on this page and nowhere else. There are almost 200 of them with div overlays above them so it's hard to save them manually!

[edit]

The path tree looks exactly like this:

First level:

domain.com/profile/username/albums/

Second level:

domain.com/profile/username/albums/1,My Birthday Photos/

domain.com/profile/username/albums/2,Photos_From_2011-09-25/

Third level:

domain.com/profile/username/albums/1,My Birthday Photos/75893989,

domain.com/profile/username/albums/2,Photos_From_2011-09-25/74893213,

Best Answer

OK.. All photos in the 2 albums are retrieved.

As to how, and as can be ascertained from comments I made and with michail's remarks.

There are two albums..
http://www.fotka.pl/profil/AlekSanDraa2601/albumy/ one has 100 photos, the other 63 photos.

Here the one with 100 of them http://www.megaupload.com/?d=30RWXKN9 Here's the album with 63 of them http://www.megaupload.com/?d=CC27NM41

Taking the source code from here, first album http://www.fotka.pl/profil/AlekSanDraa2601/albumy/1,Ja/74892555

Extracting the image URLs All the thumbnails end in _72_p.jpg We don't want them we want the larger versions, they require in the URL that amin.fotka be changed to a.fotka, and _72_p be changed to _500_s

This is the same for the second album.. so for example, for the second album with 63 photos http://www.fotka.pl/profil/AlekSanDraa2601/albumy/2,Fotki_z_2011-09-25/75893982,,1319485161

here is blist3.txt A list with all the JPGs listed in _72_p form http://pastebin.com/raw.php?i=Y2nXfAXT

You can get that with a line like this..

C:\>type source.txt | grep -oE "http://.*?\.jpg"  >urls

edit the source to remove any miscellaneous parts.. like HTML attributes, obvious things that shouldn't be there.

or use this line which is better and should just get them all without anything miscellaneous to remove.

C:\>type source.txt | grep -oE "http://[^ ]*\.jpg"  >urls

You have more URLs than you want there, for the second album, that command gives 97 and you only want the ones with _72_p in the URL

So | grep -E "72_p" so you get a list of just the photos you want.

C:\>type list.txt | wc -l

63 see there are 63 in that file, the right number.

that is all of them in that album. All 63

wget -i list.txt -w 3

http://www.megaupload.com/?d=CC27NM41

So that's all of them, all 163(100+63) of them, from the two albums.

This is the line one would use to take a list of the JPGs
listps2.txt is a file with all JPGs, both relevant ones and irrelevant ones. The relevant ones are in 72_p form, extract the relevant ones with grep. And change them with SED. put them in "thatfile", and you can then do wget -i thatfile -w 3. As I did.

C:\>type listps2.txt | grep "72_p" | sed "s/_72_p/_500_s/" | sed "s/amin\.fotka/a.fotka/" >thatfile

C:\>wget -i thatfile

Related Solutions

Save a single web page (with background images) with Wget

From the Wget man page:

Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:

wget -E -H -k -K -p http://www.example.com/

Also in case robots.txt is disallowing you add -e robots=off

How to download all images from a website using wget

Here is the working command:

wget -U "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0" -nd -r --level=1  -e robots=off -A jpg,jpeg -H http://pixabay.com/

-U "..." : The website is returning HTTP error 403 (forbidden) as it only allows a given list of User-Agent to access their pages. You have to stipulate an User-Agent of a common browser (firefox, chrome, ...). The one I gave you is a working example.
-nd (no-directories) from man: "Do not create a hierarchy of directory when retrieving recursively."
-e robots=off: do not follow robot.txt exclusion
-H: enable retrieving files across hosts (here pixabay.com and cdn.pixabay.com are considered as different hosts)

if there is some rate limit mechanism, add the following option --wait 1

Best Answer

Related Solutions

Save a single web page (with background images) with Wget

How to download all images from a website using wget

Related Question