How to use wget with a list of URLs and their corresponding output files

downloadfileslinkswget

Suppose list_of_urls looks like this:

http://www.url1.com/some.txt

I know how to use that with:

wget -i list_of_urls

But, what if my list_of_urls has this, and they all return proper files like PDF's or videos:

http://www.url1.com/app?q=123&gibb=erish&gar=ble
http://www.url2.com/app?q=111&wha=tcha&mac=allit

For a single file I could do this:

wget -O some.txt "http://www.url1.com/app?q=123&gibb=erish&gar=ble"

How do I use wget to download that list of URLs and save the returned data to the proper local file?

Best Answer

By default, wget writes to a file whose name is the last component of the URL that you pass to it. Many servers redirect URLs like http://www.url1.com/app?q=123&gibb=erish&gar=ble to a different URL with a nice-looking file name like http://download.url1.com/files/something.pdf. You can tell wget to use the name from the redirected URL (i.e. something.pdf) instead of app?q=123&gibb=erish&gar=ble by passing the --trust-server-names option. This isn't the default mode because, if used carelessly, it could lead to overwriting an unpredictable file name in the current directory; but if you trust the server or are working in a directory containing no other precious files, --trust-server-names is usually the right thing to use.

Some servers use a Content-Disposition header instead of redirection to specify a file name. Pass the --content-disposition option to make wget use this file name.

Thus:

wget --content-disposition --trust-server-names -i list_of_urls

If you still aren't getting nice-looking file names, you may want to specify your own. Suppose you have a file containing lines like

http://www.url1.com/app?q=123&gibb=erish&gar=ble foo.pdf
http://www.url2.com/app?q=111&wha=tcha&mac=allit bar.txt

To make wget download the files to the specified file names, assuming there are no whitespace characters in the URL or in the file names:

err=0
while read -r url filename tail; do
  wget -O "$filename" "$url" || err=1
done <list_of_urls_and_file_names

The err variable contains 0 if all downloads succeeded and 1 otherwise, you can return $err if you put this snippet in a function or exit $err if you put this snippet in a string.

If you don't want to specify anything other than the URLs, and you can't get nice names from the server, you can guess the file type and attempt to get at least meaningful extensions.

err=0
n=1
while read -r url; do
  if wget -O tmpfile "$url"; then
    ext=data
    case $(file -i tmpfile) in
      application/pdf) ext=pdf;;
      image/jpeg) ext=jpg;;
      text/html) ext=html;;
      text/*) ext=txt;;
    esac
    mv tmpfile "$n.$ext"
  else
    err=1
  fi
  n=$((n+1))
done

Add other types as desired. If your file command doesn't have the -m option, leave it out, and check what file returns on your system for the file types you're interested in. If you have a file /etc/mime.types on your system, you can read associations of MIME types to extensions from it instead of supplying your own list:

n=1
while read -r url; do
  if wget -O tmpfile "$url"; then
    mime_type=$(file -m tmpfile)
    ext=$(awk "$1 == \"$mime_type\" {print \$2; exit} END {print \"data\"}" /etc/mime.types)
    mv tmpfile "$n.$ext"
  else
    err=1
  fi
  n=$((n+1))
done

Related Solutions

Wget Tool – How to Copy Folders from public.me.com with Wget-Like Tool

That server is clearly running a partial or broken implementation of WebDAV. Note that you need to connect to an URL like https://public.me.com/ix/rudchenko, not the normal URL https://public.me.com/rudchenko. I tried several clients:

With a normal HTTP downloader such as wget or curl, I could download a file knowing its name (e.g. wget https://public.me.com/ix/rudchenko/directory/filename), but was not able to obtain a directory listing.
FuseDAV, which would have been my first choice, is unable to cope with some missing commands. It apparently manages to list the root directory (visible in the output from fusedav -D) but eventually runs some request that returns “PROPFIND failed: 404 Not Found” and locks up.
Nd lacks a list command.
Cadaver works well, but lacks a recursive retrieval command. You could use it to obtain listings, then retrieve individual files as above.

It's not perfect, and there is a problem specifically in this case: cadaver's mget fails to treat args with wildcards that expand to filenames with spaces.
Davfs2 works very well. I could mount that share and copy files from it. The only downside is that this is not a FUSE filesystem, you need root to mount it or an entry in /etc/fstab.

The FUSE-based wdfs-1.4.2-alt0.M51.1 worked very well in this case, requiring no root (only permissions for /dev/fuse).

mkdir viewRemote
wdfs https://public.me.com/ix/rudchenko/ viewRemote
rsync -a viewRemote/SEM*TO\ PRINT* ./
fusermount -u viewRemote
rmdir viewRemote

(Of course, a simple cp instead of rsync would work well in this example; rsync was chosen merely for extra diagnostics about the difference when we would update the copy.)

(Apart from wdfs, I tried these commands on a Debian squeeze system. Your mileage may vary.)

Package Management – wget Downloaded File Names with Absolute Path for Use with apt-get

The simple approach

If you are using another machine, one which is connected to the internet, to upgrade your target computer, you can simply use the same sources.list file on the two machines, run apt-get update and then copy the package lists over:

$ sudo mv /etc/apt/sources.list /etc/apt/sources.list.orig
$ sudo mv target_computer_sources.list /etc/apt/sources.list
$ sudo apt-get update

apt stores its repository file lists in /var/lib/apt/lists/, so afer running the command above, you cam copy everything in /var/lib/apt/lists/ to the target machine and then revert to the original sources.list:

$ sudo mv /etc/apt/sources.list.orig /etc/apt/sources.list

The complex way

I for whatever reason the method above does not work for you, you can do it the way you were suggesting, you just need to parse the output of apt-get update --print-uris. The following are from my LMDE system but the idea is the same:

$ sudo apt-get update --print-uris -y | head
'ftp://ftp.debian.org/debian/dists/testing/main/source/Sources.bz2' ftp.debian.org_debian_dists_testing_main_source_Sources 0 :
'ftp://ftp.debian.org/debian/dists/testing/contrib/source/Sources.bz2' ftp.debian.org_debian_dists_testing_contrib_source_Sources 0 :
'ftp://ftp.debian.org/debian/dists/testing/non-free/source/Sources.bz2' ftp.debian.org_debian_dists_testing_non-free_source_Sources 0 :
'ftp://ftp.debian.org/debian/dists/testing/InRelease' ftp.debian.org_debian_dists_testing_InRelease 0 
'http://packages.linuxmint.com/dists/debian/main/source/Sources.bz2' packages.linuxmint.com_dists_debian_main_source_Sources 0 :
'http://packages.linuxmint.com/dists/debian/upstream/source/Sources.bz2' packages.linuxmint.com_dists_debian_upstream_source_Sources 0 :
'http://packages.linuxmint.com/dists/debian/import/source/Sources.bz2' packages.linuxmint.com_dists_debian_import_source_Sources 0 :
'http://packages.linuxmint.com/dists/debian/main/binary-amd64/Packages.bz2' packages.linuxmint.com_dists_debian_main_binary-amd64_Packages 0 :
'http://packages.linuxmint.com/dists/debian/upstream/binary-amd64/Packages.bz2' packages.linuxmint.com_dists_debian_upstream_binary-amd64_Packages 0 :
'http://packages.linuxmint.com/dists/debian/import/binary-amd64/Packages.bz2' packages.linuxmint.com_dists_debian_import_binary-amd64_Packages 0 :

In the output above, the 1st field is the URL and the 2nd is the name the file will be saved under. As I mentioned before, apt stores its repository file lists in /var/lib/apt/lists/, you can verify this by running locate on one of the 2nd fields above:

$ locate packages.linuxmint.com_dists_debian_main_source_Sources
/var/lib/apt/lists/packages.linuxmint.com_dists_debian_main_source_Sources

So, if you want to download and update each of those lists, you will need to parse the output of apt-get update --print-uris and tell wget to use the 2nd field as the output name:

$ sudo apt-get update --print-uris -y | sed "s/'//g" | cut -d ' ' -f 1,2 | 
   while read url target; do wget $url -O ./$target; done

This will download each package list and save it in the current directory using the appropriate name. You can now copy these files over to the /var/lib/apt/lists/ directory of the target machine.