Wget – How to Create a Local Copy of a Website Section on OSX

mirrorrecursivewget

This question follows from: How do I create a local copy of a complete website section from OSX using curl?

After discovering OSX's native curl wouldn't do this task I downloaded wget from here: http://www.techtach.org/wget-prebuilt-binary-for-mac-osx-lion

But performing:

./wget -r -l 0 https://ccrma.stanford.edu/~jos/mdft/

takes hours and installs a ton of other stuff I didn't want that ISN'T contained in this folder:

http://cl.ly/ENKr

Moreover opening a particular page, many of the images are missing:

http://cl.ly/ELXG

This may be because I aborted the transfer after a few hours(!)

How do I do this properly?

Best Answer

try adding:

--no-parent

"Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded."

In my experience it also prevents downloading from other sites.

Example

Download where file robots.txt doesn't already exist locally.

$ wget -N http://google.com/robots.txt
--2014-06-15 21:18:16--  http://google.com/robots.txt
Resolving google.com (google.com)... 173.194.41.9, 173.194.41.14, 173.194.41.0, ...
Connecting to google.com (google.com)|173.194.41.9|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.com/robots.txt [following]
--2014-06-15 21:18:17--  http://www.google.com/robots.txt
Resolving www.google.com (www.google.com)... 173.194.46.83, 173.194.46.84, 173.194.46.80, ...
Connecting to www.google.com (www.google.com)|173.194.46.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘robots.txt’

    [ <=>                                                                                                                                 ] 7,608       --.-K/s   in 0s      

2014-06-15 21:18:17 (359 MB/s) - ‘robots.txt’ saved [7608]

Trying it a second time with the file robots.txt locally:

$ wget -N http://google.com/robots.txt
--2014-06-15 21:18:19--  http://google.com/robots.txt
Resolving google.com (google.com)... 173.194.41.8, 173.194.41.9, 173.194.41.14, ...
Connecting to google.com (google.com)|173.194.41.8|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.com/robots.txt [following]
--2014-06-15 21:18:19--  http://www.google.com/robots.txt
Resolving www.google.com (www.google.com)... 173.194.46.82, 173.194.46.83, 173.194.46.84, ...
Connecting to www.google.com (www.google.com)|173.194.46.82|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Server file no newer than local file ‘robots.txt’ -- not retrieving.

Notice that the 2nd time, wget did not retrieve the file again.

How to download all files linked on a website using wget

wget's -A option takes a comma-separated accept LIST, not just a single item.

wget --no-directories --content-disposition --restrict-file-names=nocontrol \
    -e robots=off -A.pdf,.ppt,.doc -r url

See man wget and search for -A for more details.

Best Answer

Related Solutions

Make wget refer to a local copy without redundantly downloading files

Example

How to download all files linked on a website using wget

Related Question