Downloading files using wget

wget

I am trying to download files from this website.

The URL is: http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE48191&format=file

When I use this command:

wget http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE48191&format=file

I get only index.html?acc=GSE48191 which is some kind of binary format.

How can I download the files from this HTTP site?

Best Answer

I think your ? gets interpreted by shell (Correction by vinc17: more likely, it's the & which gets interpreted).

Just try with simple quotes around your URL:

wget 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE48191&format=file'

Note that the file you are requesting is a .tar file but the above command will save it as index.html?acc=GSE48191&format=file. To have it correctly named, you can either rename it to .tar:

mv 'index.html?acc=GSE48191&format=file' GSE4819.tar

Or you can give the name as an option to wget:

wget -O GSE48191.tar 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE48191&format=file'

The above command will save the downloaded file as GSE48191.tar directly.

Related Solutions

Wget Tips – How to Download Files When the Page Makes You Wait

I'm unsure which version of wget or OS and any proxy's exist between you and sourceforge but wget downloaded the file when I removed the "/download" and left it at the file extension.

I don't want to flood the post or pastebin my entire session but I got the 302 then 200 status codes before the transfer began. What happens when you try wget?

Resolving downloads.sourceforge.net... 216.34.181.59
Connecting to downloads.sourceforge.net|216.34.181.59|:80... connected.
HTTP request sent, awaiting response... 302 Found

[snipped for brevity]

HTTP request sent, awaiting response... 200 OK
Length: 13432789 (13M) [application/x-gzip]
Saving to: `download'

Wget to get all the files in a directory only returns index.html

-A "Sample" kind of acts like a Sample* would in bash

Not by my reading of man wget:

-A acclist --accept acclist

-R rejlist --reject rejlist

Specify comma-separated lists of file name suffixes or patterns to accept or reject. Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.

So your usage (no wildcards) is equivalent to the bash glob *.Sample.

Wget works by scanning links, which is probably why it is trying to download an index.html (you haven't said what the content of that is, if any, just that it took a long time) -- it has to have somewhere to start. To explain further: an URL is not a file path. You cannot scan a web server as if it were a directory hierarchy, saying, "give me all the files in directory foobar". Iffoobar corresponds to a real directory (it certainly doesn't have to, because it's part of an URL, not a file path), a web server may be configured to provide an autogenerated index.html listing the files, providing the illusion that you can browse the filesystem. But that's not part of the HTTP protocol, it's just a convention used by default with servers like apache. So what wget does is scan, e.g., index.html for <a href= and <img src=, etc., then it follows those links and does the same thing, recursively. That's what wget's "recursive" behaviour refers to -- it recursively scans links because (to reiterate), it does not have access to any filesystem on the server, and the server does not have to provide it with ANY information regarding such.

If you have an actual .html web page that you can load and click through to all the things you want, start with that address, and use just -r -np -k -p.

Best Answer

Related Solutions

Wget Tips – How to Download Files When the Page Makes You Wait

Wget to get all the files in a directory only returns index.html

Related Question