I'm new to using bash, and I have been trying to wget
all the files from a website to the server I have been working on. However all I'm getting back is an index.html
file. I let it run for 15 minutes and the index.html file was still downloading so I killed it. Might my files be download after the index.html
file?
Here is the code I have been trying:
$ wget --no-parent -R index.html -A "Sample" -nd --random-wait \
-r -p -e robots=off -U Mozilla --no-check-certificate \
http://somewebsite.com/hasSamples/Sample0
I'm trying to download all the files in a subdirectory that starts with Sample. I have searched quite a bit on the internet to find a resolution, and at this point I'm stumped. I probably just haven't found the right combinations of options, but any help would be much appreciated. Here is my understanding of the code:
--no-parent
means don't search parent directories-R index.html
means reject downloading the index.html file, I also tried "index.html*", but it still downloaded it anyway-A "Sample"
kind of acts like a Sample* would in bash-nd
means download the files and not any of the directories--random-wait
to make sure you don't get blacklisted from a site-r
recursively downloads-p
not sure really-e robots=off
ignores robot.txt filesU Mozilla
makes the user look like its Mozilla I think- The
--no-check-certificate
is just necessary for the website.
Best Answer
Not by my reading of
man wget
:So your usage (no wildcards) is equivalent to the bash glob
*.Sample
.Wget works by scanning links, which is probably why it is trying to download an
index.html
(you haven't said what the content of that is, if any, just that it took a long time) -- it has to have somewhere to start. To explain further: an URL is not a file path. You cannot scan a web server as if it were a directory hierarchy, saying, "give me all the files in directoryfoobar
". Iffoobar
corresponds to a real directory (it certainly doesn't have to, because it's part of an URL, not a file path), a web server may be configured to provide an autogenerated index.html listing the files, providing the illusion that you can browse the filesystem. But that's not part of the HTTP protocol, it's just a convention used by default with servers like apache. So whatwget
does is scan, e.g.,index.html
for<a href=
and<img src=
, etc., then it follows those links and does the same thing, recursively. That's what wget's "recursive" behaviour refers to -- it recursively scans links because (to reiterate), it does not have access to any filesystem on the server, and the server does not have to provide it with ANY information regarding such.If you have an actual
.html
web page that you can load and click through to all the things you want, start with that address, and use just-r -np -k -p
.