Linux – Using Wget to Recursively Crawl a Site and Download Images

bashlinuxscriptweb-crawlerwget

How do you instruct wget to recursively crawl a website and only download certain types of images?

I tried using this to crawl a site and only download Jpeg images:

wget --no-parent --wait=10 --limit-rate=100K --recursive --accept=jpg,jpeg --no-directories http://somedomain/images/page1.html

However, even though page1.html contains hundreds of links to subpages, which themselves have direct links to images, wget reports things like "Removing subpage13.html since it should be rejected", and never downloads any images, since none are directly linked to from the starting page.

I'm assuming this is because my –accept is being used to both direct the crawl and filter content to download, whereas I want it used only to direct the download of content. How can I make wget crawl all links, but only download files with certain extensions like *.jpeg?

EDIT: Also, some pages are dynamic, and are generated via a CGI script (e.g. img.cgi?fo9s0f989wefw90e). Even if I add cgi to my accept list (e.g. –accept=jpg,jpeg,html,cgi) these still always get rejected. Is there a way around this?

Best Answer

Why won't you try to use wget -A jpg,jpeg -r http://example.com?

Related Question