Ignore “other” domains when downloading with wget

command linewget

I would like to crawl links under www.website.com/XYZ and only download the links that is under www.website.com/ABC.

I am using the following wget command to get the files I want:

wget  -I ABC -r -e robots=off --wait 0.25  http://www.website.com/XYZ

This works perfectly when I use wget 1.13.4. But the problem is I have to use this command on a server which has wget 1.11 and when I use the same command, it ends up downloading additional domains such as:

www.website.de 
www.website.it 
...

How can I avoid this problem? I tried using

--exclude domains=www.website.de,www.website.it

however it kept downloading those domains.

Also note that I can't use --no-parent since the files I want is in upper level (I want files under website.com/ABC by crawling links under website.com/XYZ).

Any hints?

Best Answer

This is wrong:

--exclude domains=www.website.de,www.website.it

The right way is:

--exclude-domains www.website.de,www.website.it

From the wget man page:

--exclude-domains domain-list
      Specify the domains that are not to be followed.

Related Solutions

Wget Tool – How to Copy Folders from public.me.com with Wget-Like Tool

That server is clearly running a partial or broken implementation of WebDAV. Note that you need to connect to an URL like https://public.me.com/ix/rudchenko, not the normal URL https://public.me.com/rudchenko. I tried several clients:

With a normal HTTP downloader such as wget or curl, I could download a file knowing its name (e.g. wget https://public.me.com/ix/rudchenko/directory/filename), but was not able to obtain a directory listing.
FuseDAV, which would have been my first choice, is unable to cope with some missing commands. It apparently manages to list the root directory (visible in the output from fusedav -D) but eventually runs some request that returns “PROPFIND failed: 404 Not Found” and locks up.
Nd lacks a list command.
Cadaver works well, but lacks a recursive retrieval command. You could use it to obtain listings, then retrieve individual files as above.

It's not perfect, and there is a problem specifically in this case: cadaver's mget fails to treat args with wildcards that expand to filenames with spaces.
Davfs2 works very well. I could mount that share and copy files from it. The only downside is that this is not a FUSE filesystem, you need root to mount it or an entry in /etc/fstab.

The FUSE-based wdfs-1.4.2-alt0.M51.1 worked very well in this case, requiring no root (only permissions for /dev/fuse).

mkdir viewRemote
wdfs https://public.me.com/ix/rudchenko/ viewRemote
rsync -a viewRemote/SEM*TO\ PRINT* ./
fusermount -u viewRemote
rmdir viewRemote

(Of course, a simple cp instead of rsync would work well in this example; rsync was chosen merely for extra diagnostics about the difference when we would update the copy.)

(Apart from wdfs, I tried these commands on a Debian squeeze system. Your mileage may vary.)

Make wget refer to a local copy without redundantly downloading files

I believe if you include the switch -N it will force wget to make use of timestamps.

   -N
   --timestamping
       Turn on time-stamping.

With this switch, wget will only download files that it does not already have locally.

Example

Download where file robots.txt doesn't already exist locally.

$ wget -N http://google.com/robots.txt
--2014-06-15 21:18:16--  http://google.com/robots.txt
Resolving google.com (google.com)... 173.194.41.9, 173.194.41.14, 173.194.41.0, ...
Connecting to google.com (google.com)|173.194.41.9|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.com/robots.txt [following]
--2014-06-15 21:18:17--  http://www.google.com/robots.txt
Resolving www.google.com (www.google.com)... 173.194.46.83, 173.194.46.84, 173.194.46.80, ...
Connecting to www.google.com (www.google.com)|173.194.46.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘robots.txt’

    [ <=>                                                                                                                                 ] 7,608       --.-K/s   in 0s      

2014-06-15 21:18:17 (359 MB/s) - ‘robots.txt’ saved [7608]

Trying it a second time with the file robots.txt locally:

$ wget -N http://google.com/robots.txt
--2014-06-15 21:18:19--  http://google.com/robots.txt
Resolving google.com (google.com)... 173.194.41.8, 173.194.41.9, 173.194.41.14, ...
Connecting to google.com (google.com)|173.194.41.8|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.com/robots.txt [following]
--2014-06-15 21:18:19--  http://www.google.com/robots.txt
Resolving www.google.com (www.google.com)... 173.194.46.82, 173.194.46.83, 173.194.46.84, ...
Connecting to www.google.com (www.google.com)|173.194.46.82|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Server file no newer than local file ‘robots.txt’ -- not retrieving.

Notice that the 2nd time, wget did not retrieve the file again.

Best Answer

Related Solutions

Wget Tool – How to Copy Folders from public.me.com with Wget-Like Tool

Make wget refer to a local copy without redundantly downloading files

Example

Related Question