How to use wget with a list of URLs and their corresponding output files

downloadfileslinkswget

Suppose list_of_urls looks like this:

http://www.url1.com/some.txt

I know how to use that with:

wget -i list_of_urls

But, what if my list_of_urls has this, and they all return proper files like PDF's or videos:

http://www.url1.com/app?q=123&gibb=erish&gar=ble
http://www.url2.com/app?q=111&wha=tcha&mac=allit

For a single file I could do this:

wget -O some.txt "http://www.url1.com/app?q=123&gibb=erish&gar=ble"

How do I use wget to download that list of URLs and save the returned data to the proper local file?

Best Answer

By default, wget writes to a file whose name is the last component of the URL that you pass to it. Many servers redirect URLs like http://www.url1.com/app?q=123&gibb=erish&gar=ble to a different URL with a nice-looking file name like http://download.url1.com/files/something.pdf. You can tell wget to use the name from the redirected URL (i.e. something.pdf) instead of app?q=123&gibb=erish&gar=ble by passing the --trust-server-names option. This isn't the default mode because, if used carelessly, it could lead to overwriting an unpredictable file name in the current directory; but if you trust the server or are working in a directory containing no other precious files, --trust-server-names is usually the right thing to use.

Some servers use a Content-Disposition header instead of redirection to specify a file name. Pass the --content-disposition option to make wget use this file name.

Thus:

wget --content-disposition --trust-server-names -i list_of_urls

If you still aren't getting nice-looking file names, you may want to specify your own. Suppose you have a file containing lines like

http://www.url1.com/app?q=123&gibb=erish&gar=ble foo.pdf
http://www.url2.com/app?q=111&wha=tcha&mac=allit bar.txt

To make wget download the files to the specified file names, assuming there are no whitespace characters in the URL or in the file names:

err=0
while read -r url filename tail; do
  wget -O "$filename" "$url" || err=1
done <list_of_urls_and_file_names

The err variable contains 0 if all downloads succeeded and 1 otherwise, you can return $err if you put this snippet in a function or exit $err if you put this snippet in a string.

If you don't want to specify anything other than the URLs, and you can't get nice names from the server, you can guess the file type and attempt to get at least meaningful extensions.

err=0
n=1
while read -r url; do
  if wget -O tmpfile "$url"; then
    ext=data
    case $(file -i tmpfile) in
      application/pdf) ext=pdf;;
      image/jpeg) ext=jpg;;
      text/html) ext=html;;
      text/*) ext=txt;;
    esac
    mv tmpfile "$n.$ext"
  else
    err=1
  fi
  n=$((n+1))
done

Add other types as desired. If your file command doesn't have the -m option, leave it out, and check what file returns on your system for the file types you're interested in. If you have a file /etc/mime.types on your system, you can read associations of MIME types to extensions from it instead of supplying your own list:

n=1
while read -r url; do
  if wget -O tmpfile "$url"; then
    mime_type=$(file -m tmpfile)
    ext=$(awk "$1 == \"$mime_type\" {print \$2; exit} END {print \"data\"}" /etc/mime.types)
    mv tmpfile "$n.$ext"
  else
    err=1
  fi
  n=$((n+1))
done
Related Question