Ubuntu – Wget batch of files fails, curl works, what am I doing wrong

command linewget

I am trying to download the entire directory from this website: https://data.geobasis-bb.de/geobasis/daten/dgm/xyz/

What I tried is:

wget --show-progress -A 'dgm_*.zip' https://data.geobasis-bb.de/geobasis/daten/dgm/xyz/ -P /run/media/usr1/exthdd/dgm

What it should do, as far as I understand it, is download all files that fit the name schmea dgm_.zip*. However it returns only:

--2020-01-13 14:50:11--  https://data.geobasis-bb.de/geobasis/daten/dgm/xyz/
CA-Zertifikat »/etc/ssl/certs/ca-certificates.crt« wurde geladen
Auflösen des Hostnamens data.geobasis-bb.de (data.geobasis-bb.de)… 194.99.76.18, 194.76.232.112
Verbindungsaufbau zu data.geobasis-bb.de (data.geobasis-bb.de)|194.99.76.18|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: nicht spezifiziert [text/html]
Wird in »/run/media/lgoldmann/lg_backup_diss/dgm/index.html.tmp.2« gespeichert.

index.html.tmp.2                             [   <=>                                                                             ]   2,65M  4,69MB/s    in 0,6s    

2020-01-13 14:50:15 (4,69 MB/s) - »/run/media/lgoldmann/lg_backup_diss/dgm/index.html.tmp.2« gespeichert [2778920]

The website also offers a pretyped comand for curl, which works just fine, but I am trying to find out, what went wrong with my wget command.

Best Answer

You need to use the -r option to get all links on the page or, otherwise, wget will get only the first page that is served by the web server (i.e. default or index) and quit.

It is wise when using -r to use -np to exclude parent links and make sure wget does not follow links that are one level or more up.

Also you might not want wget to rebuild the directory structure of the site locally and just download the files, so also use the -nd option like so:

wget --show-progress -A 'dgm_*.zip' -r -np -nd https://data.geobasis-bb.de/geobasis/daten/dgm/xyz/ -P /run/media/usr1/exthdd/dgm

Related Solutions

Ubuntu – Get all image files with wget

If a target web server has directory indexing enabled, and all the files to download are located in the same directory, you can download all of them, by using wget's recursive retrieval option.

use

wget -r -l1 -A.jpg http://www.example.com/test/

it will download all .jpg file from directory test.

if you don't want to download all file then the infos below will be helpful.

   -A acclist --accept acclist
   -R rejlist --reject rejlist
       Specify comma-separated lists of file name suffixes or patterns to 
       accept or reject. Note that if any of the wildcard characters, *, ?,
       [ or ], appear in an element of acclist or rejlist, it will be 
       treated as a pattern, rather than a suffix.

   --accept-regex urlregex
   --reject-regex urlregex
       Specify a regular expression to accept or reject the complete URL.

like :

wget -r --no-parent -A '*.jpg' http://example.com/test/

Ubuntu – `curl -O` stores an empty file though `wget` works well

You've missed to follow redirections with curl as the URL endpoint is redirected (301) to another endpoint (https://s3.amazonaws.com/logzio-elk/apache-daily-access.log); sending a request with HEAD method (-I) to the specified URL:

% curl -LI https://logz.io/sample-data
HTTP/1.1 301 Moved Permanently
...
...
Location: https://s3.amazonaws.com/logzio-elk/apache-daily-access.log
...

HTTP/1.1 200 OK
...
...
Server: AmazonS3

As curl does not follow HTTP redirections by default, you need to tell curl to do so using the -L/--location option:

curl -LO https://logz.io/sample-data

As wget follows redirections by default, you're getting to the eventual URL with wget as-is.

Best Answer

Related Solutions

Ubuntu – Get all image files with wget

Ubuntu – `curl -O` stores an empty file though `wget` works well

Related Question