As said in wget
man page:
to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to -p:
wget -E -H -k -K -p http://mysite.com/directory
I understand, if I want to download mysite entirely, I have to add -r
argument. But using both -r
and -H
options yields downloading all the website reachable from http://mysite.com/directory. Any idea?
Best Answer
If you want to use
wget
, you can use the mirror setting to make an offline copy of a website, although some websites might prevent it with their robots.txt settings that stops automated spidering. I have always had a few problems withwget
(see my other suggestion below), but the following command does work for many sites. However, be aware that the addition of the-H
switch allows it to access all links that are on other sites and to save those also. This command switch can obviously be removed if it is not required.The command to
wait
allows some gaps betweenwget's
requests so that the site is not overwhelmed, and the-x
command switch specifies that the site's directory structure should be exactly mirrored in a folder in your home folder. The-m
switch obviously stands for mirror mode, which allowswget
to download recursively through the site; and the-k
switch means that after download the files referenced will be those in your mirror directory in your home folder and not those back at the site itself.After
man wget
, perhaps the best listing and detailed explanation ofwget
commands is here.If
wget
is unsuccessful and you can't grab as much as you want, I should try the command line programhttrack
or its web interface,webhttrack
, which are available in the repositories. There are a large amount of options for this program, but it is better for downloading whole websites or parts of websites thanwget
.Webhttrack
gives you a wizard to follow for downloading a site (it opens in your browser) as the screenshot below shows.