Essentially, I want to crawl an entire site with Wget, but I need it to NEVER download other assets (e.g. imagery, CSS, JS, etc.). I only want the HTML files.
Google searches are completely useless.
Here's a command I've tried:
wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -E -e robots=off -U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36" -A html --domain=www.example.com http://www.example.com
Our site is hybrid flat-PHP and CMS. So, HTML "files" could be /path/to/page
, /path/to/page/
, /path/to/page.php
, or /path/to/page.html
.
I've even included -R js,css
but it still downloads the files, THEN rejects them (pointless waste of bandwidth, CPU, and server load!).
Best Answer
@ernie's comment about
--ignore-tags
lead me down the right path! When I looked up--ignore-tags
inman
, I noticed--follow-tags
.Setting
--follow-tags=a
allowed me to skipimg
,link
,script
, etc.It's probably too limited for some people looking for the same answer, but it actually works well in my case (it's okay if I miss a couple pages).
If anyone finds a way to allow for scanning ALL tags, but prevents
wget
from rejecting files only after they're downloaded (they should reject based on filename or header Content-type before downloading), I will very happily accept their answer!