Make wget refer to a local copy without redundantly downloading files

wget

I want to archive a message board,I do that by using wget with the parameters: --page-requisites, --span-hosts, --convert-links and --no-clobber.

The problem is that using --convert-links disables --no-clobber. for every thread page, wget re-downloads site skins, scripts and icons(For the purpose of keeping them updated).

Is there a way to prevent wget from downloading files that already exist locally, referring links to files to their local copies and only downloading files that aren't already in the filesystem?

Best Answer

I believe if you include the switch -N it will force wget to make use of timestamps.

   -N
   --timestamping
       Turn on time-stamping.

With this switch, wget will only download files that it does not already have locally.

Example

Download where file robots.txt doesn't already exist locally.

$ wget -N http://google.com/robots.txt
--2014-06-15 21:18:16--  http://google.com/robots.txt
Resolving google.com (google.com)... 173.194.41.9, 173.194.41.14, 173.194.41.0, ...
Connecting to google.com (google.com)|173.194.41.9|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.com/robots.txt [following]
--2014-06-15 21:18:17--  http://www.google.com/robots.txt
Resolving www.google.com (www.google.com)... 173.194.46.83, 173.194.46.84, 173.194.46.80, ...
Connecting to www.google.com (www.google.com)|173.194.46.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘robots.txt’

    [ <=>                                                                                                                                 ] 7,608       --.-K/s   in 0s      

2014-06-15 21:18:17 (359 MB/s) - ‘robots.txt’ saved [7608]

Trying it a second time with the file robots.txt locally:

$ wget -N http://google.com/robots.txt
--2014-06-15 21:18:19--  http://google.com/robots.txt
Resolving google.com (google.com)... 173.194.41.8, 173.194.41.9, 173.194.41.14, ...
Connecting to google.com (google.com)|173.194.41.8|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.com/robots.txt [following]
--2014-06-15 21:18:19--  http://www.google.com/robots.txt
Resolving www.google.com (www.google.com)... 173.194.46.82, 173.194.46.83, 173.194.46.84, ...
Connecting to www.google.com (www.google.com)|173.194.46.82|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Server file no newer than local file ‘robots.txt’ -- not retrieving.

Notice that the 2nd time, wget did not retrieve the file again.

Related Solutions

Downloading files using wget

I think your ? gets interpreted by shell (Correction by vinc17: more likely, it's the & which gets interpreted).

Just try with simple quotes around your URL:

wget 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE48191&format=file'

Note that the file you are requesting is a .tar file but the above command will save it as index.html?acc=GSE48191&format=file. To have it correctly named, you can either rename it to .tar:

mv 'index.html?acc=GSE48191&format=file' GSE4819.tar

Or you can give the name as an option to wget:

wget -O GSE48191.tar 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE48191&format=file'

The above command will save the downloaded file as GSE48191.tar directly.

Best Answer

Example

Related Solutions

Downloading files using wget

Related Question