Make wget refer to a local copy without redundantly downloading files

wget

I want to archive a message board,I do that by using wget with the parameters: --page-requisites, --span-hosts, --convert-links and --no-clobber.

The problem is that using --convert-links disables --no-clobber. for every thread page, wget re-downloads site skins, scripts and icons(For the purpose of keeping them updated).

Is there a way to prevent wget from downloading files that already exist locally, referring links to files to their local copies and only downloading files that aren't already in the filesystem?

Best Answer

I believe if you include the switch -N it will force wget to make use of timestamps.

   -N
   --timestamping
       Turn on time-stamping.

With this switch, wget will only download files that it does not already have locally.

Example

Download where file robots.txt doesn't already exist locally.

$ wget -N http://google.com/robots.txt
--2014-06-15 21:18:16--  http://google.com/robots.txt
Resolving google.com (google.com)... 173.194.41.9, 173.194.41.14, 173.194.41.0, ...
Connecting to google.com (google.com)|173.194.41.9|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.com/robots.txt [following]
--2014-06-15 21:18:17--  http://www.google.com/robots.txt
Resolving www.google.com (www.google.com)... 173.194.46.83, 173.194.46.84, 173.194.46.80, ...
Connecting to www.google.com (www.google.com)|173.194.46.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘robots.txt’

    [ <=>                                                                                                                                 ] 7,608       --.-K/s   in 0s      

2014-06-15 21:18:17 (359 MB/s) - ‘robots.txt’ saved [7608]

Trying it a second time with the file robots.txt locally:

$ wget -N http://google.com/robots.txt
--2014-06-15 21:18:19--  http://google.com/robots.txt
Resolving google.com (google.com)... 173.194.41.8, 173.194.41.9, 173.194.41.14, ...
Connecting to google.com (google.com)|173.194.41.8|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.com/robots.txt [following]
--2014-06-15 21:18:19--  http://www.google.com/robots.txt
Resolving www.google.com (www.google.com)... 173.194.46.82, 173.194.46.83, 173.194.46.84, ...
Connecting to www.google.com (www.google.com)|173.194.46.82|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Server file no newer than local file ‘robots.txt’ -- not retrieving.

Notice that the 2nd time, wget did not retrieve the file again.

Related Question