Ubuntu – Problem using wget to download an entire website

downloadswget

As said in wget man page:

to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to -p:

wget -E -H -k -K -p http://mysite.com/directory

I understand, if I want to download mysite entirely, I have to add -r argument. But using both -r and -H options yields downloading all the website reachable from http://mysite.com/directory. Any idea?

Best Answer

If you want to use wget, you can use the mirror setting to make an offline copy of a website, although some websites might prevent it with their robots.txt settings that stops automated spidering. I have always had a few problems with wget (see my other suggestion below), but the following command does work for many sites. However, be aware that the addition of the -H switch allows it to access all links that are on other sites and to save those also. This command switch can obviously be removed if it is not required.

 wget --wait 1 -x -H -mk http://site.to.mirror/

The command to wait allows some gaps between wget's requests so that the site is not overwhelmed, and the -x command switch specifies that the site's directory structure should be exactly mirrored in a folder in your home folder. The -m switch obviously stands for mirror mode, which allows wget to download recursively through the site; and the -k switch means that after download the files referenced will be those in your mirror directory in your home folder and not those back at the site itself.

After man wget, perhaps the best listing and detailed explanation of wget commands is here.

If wget is unsuccessful and you can't grab as much as you want, I should try the command line program httrack or its web interface, webhttrack, which are available in the repositories. There are a large amount of options for this program, but it is better for downloading whole websites or parts of websites than wget. Webhttrack gives you a wizard to follow for downloading a site (it opens in your browser) as the screenshot below shows.

Httrack

Related Solutions

Ubuntu – Download jdk 1.6 using wget

I will go ahead and post it just in case....Please note however that these instruciton install Java7 NOT Java6

cut and paste from http://www.webupd8.org

Install Oracle Java 7

For most stuff, OpenJDK/JRE is enough (and you can install it using Ubuntu Software Center), but in case you really need Oracle (previously Sun) Java 7, you can install it (this will install JDK/JRE and the browser plugin) in Ubuntu using the WebUpd8 Java PPA

in terminal type
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

This is a excerpt of my answer to this question.

The PPA that you install with thesecommands only points to the Java website to make installing easier, it does not contain Java itself. Oracle changed the license and now it can not be hosted anywhere and can only be downloaded directly from Oracle..

HERE is an older post from the same site that gives more information on this change as well as more information on Java7 itself and how to remove it , check for and select which version is running.

Are PPA's safe to add to my system and what are some "red flags" to watch out for?

Ubuntu – Download a whole website with wget (or other) including all its downloadable content

You may need to mirror the website completely, but be aware that some links may really dead. You can use HTTrack or wget:

wget -r http://winapp.com # or whatever

With HTTrack, first install it:

sudo apt-get install httrack

now run it just 1 external link:

httrack --ext-depth=1 http://winapp.com

This will download the winapp CDN files, but not the files in the files in the files in the whole internet.

Best Answer

Related Solutions

Ubuntu – Download jdk 1.6 using wget

Ubuntu – Download a whole website with wget (or other) including all its downloadable content

Related Question