Problem with recursive download using wget

downloadinternetwget

I am trying to learn how to use recursive download with wget from the wget info page.

For example, let us try to download all the images of xkcd. A list of all the pages is present in the xkcd archive. All pages have a single png file in them. The png file is present in a different host, imgs.xkcd.com.

I tried with this command:

wget -r -HD imgs.xkcd.com -l 2 -A.png http://www.xkcd.com/archive/ --random-wait

The result:

 xkcd $ tree
.

0 directories, 0 files

 xkcd $ wget -r -HD imgs.xkcd.com -l 2 -A.png http://www.xkcd.com/archive/ --random-wait
--2014-01-10 18:49:55--  http://www.xkcd.com/archive/
Resolving www.xkcd.com (www.xkcd.com)... 107.6.106.82
Connecting to www.xkcd.com (www.xkcd.com)|107.6.106.82|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 83226 (81K) [text/html]
Saving to: `www.xkcd.com/archive/index.html'

100%[=============================================================================================================>] 83,226      68.3K/s   in 1.2s    

2014-01-10 18:49:57 (68.3 KB/s) - `www.xkcd.com/archive/index.html' saved [83226/83226]

Loading robots.txt; please ignore errors.
--2014-01-10 18:49:57--  http://imgs.xkcd.com/robots.txt
Resolving imgs.xkcd.com (imgs.xkcd.com)... 107.6.106.82
Reusing existing connection to www.xkcd.com:80.
HTTP request sent, awaiting response... 404 Not Found
2014-01-10 18:49:58 ERROR 404: Not Found.

Removing www.xkcd.com/archive/index.html since it should be rejected.

--2014-01-10 18:49:58--  http://imgs.xkcd.com/static/terrible_small_logo.png
Reusing existing connection to www.xkcd.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 11001 (11K) [image/png]
Saving to: `imgs.xkcd.com/static/terrible_small_logo.png'

100%[=============================================================================================================>] 11,001      --.-K/s   in 0.05s   

2014-01-10 18:49:58 (229 KB/s) - `imgs.xkcd.com/static/terrible_small_logo.png' saved [11001/11001]

FINISHED --2014-01-10 18:49:58--
Total wall clock time: 2.9s
Downloaded: 2 files, 92K in 1.2s (74.4 KB/s)

 xkcd $ tree
.
|-- imgs.xkcd.com
|   `-- static
|       `-- terrible_small_logo.png
`-- www.xkcd.com
    `-- archive

4 directories, 1 file

 xkcd $

This is obviously not what I want. It seems that wget rejected www.xkcd.com/archive/index.html before reading it and checking for links. Even if .html is added to the accept list (as suggested in an answer), it doesn't download the images. What is the mistake in the command?

Best Answer

The problem is your restriction on which links to follow. You've set it to only follow links to imgs.xkcd.com. But the /archive/ page doesn't contain any links directly there - it only contains links to other pages on www.xkcd.com, and then those pages contain the link to imgs.xkcd.com.

So you will need to allow that domain, too. This command works:

wget -r -HD imgs.xkcd.com,www.xkcd.com -l 2 -A.png http://www.xkcd.com/archive/ --random-wait
Related Question