Wget to get all the files in a directory only returns index.html

command linewget

I'm new to using bash, and I have been trying to wget all the files from a website to the server I have been working on. However all I'm getting back is an index.html file. I let it run for 15 minutes and the index.html file was still downloading so I killed it. Might my files be download after the index.html file?

Here is the code I have been trying:

$ wget --no-parent -R index.html -A "Sample" -nd --random-wait \
   -r -p -e robots=off -U Mozilla --no-check-certificate \
   http://somewebsite.com/hasSamples/Sample0

I'm trying to download all the files in a subdirectory that starts with Sample. I have searched quite a bit on the internet to find a resolution, and at this point I'm stumped. I probably just haven't found the right combinations of options, but any help would be much appreciated. Here is my understanding of the code:

  • --no-parent means don't search parent directories
  • -R index.html means reject downloading the index.html file, I also tried "index.html*", but it still downloaded it anyway
  • -A "Sample" kind of acts like a Sample* would in bash
  • -nd means download the files and not any of the directories
  • --random-wait to make sure you don't get blacklisted from a site
  • -r recursively downloads
  • -p not sure really
  • -e robots=off ignores robot.txt files
  • U Mozilla makes the user look like its Mozilla I think
  • The --no-check-certificate is just necessary for the website.

Best Answer

-A "Sample" kind of acts like a Sample* would in bash

Not by my reading of man wget:

  • -A acclist --accept acclist
  • -R rejlist --reject rejlist

Specify comma-separated lists of file name suffixes or patterns to accept or reject. Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.

So your usage (no wildcards) is equivalent to the bash glob *.Sample.

Wget works by scanning links, which is probably why it is trying to download an index.html (you haven't said what the content of that is, if any, just that it took a long time) -- it has to have somewhere to start. To explain further: an URL is not a file path. You cannot scan a web server as if it were a directory hierarchy, saying, "give me all the files in directory foobar". Iffoobar corresponds to a real directory (it certainly doesn't have to, because it's part of an URL, not a file path), a web server may be configured to provide an autogenerated index.html listing the files, providing the illusion that you can browse the filesystem. But that's not part of the HTTP protocol, it's just a convention used by default with servers like apache. So what wget does is scan, e.g., index.html for <a href= and <img src=, etc., then it follows those links and does the same thing, recursively. That's what wget's "recursive" behaviour refers to -- it recursively scans links because (to reiterate), it does not have access to any filesystem on the server, and the server does not have to provide it with ANY information regarding such.

If you have an actual .html web page that you can load and click through to all the things you want, start with that address, and use just -r -np -k -p.

Related Question