Linux – How to get uncompressed content when using recursive wget

command linecompressionlinuxwget

i am downloading many single pages with all static content (js, css, imgs…) via wget recursive. It showed up, that served content, which was compressed (gzip), is stored by wget in compressed form. But I want uncompressed form. It is not easy to imagine writing another script which would go through dirs recursively and trying to uncompress what is possible. So is there any way to get it uncompressed?

CMD:

wget -E -H -k -K -p https://some.example

even –header='Accept-Encoding: ' (telling server to not use gzip) did not help.

Thank you for advices 🙂

Best Answer

Use httrack instead of wget
Setup decompression proxy. Squid with some 3rd party plugin should be able to do that. I'm more familiar with Java so I used LittleProxy, overrode method getMaximumResponseBufferSizeInBytes() and that was it. I wrote about the later here.

EDIT: Wget 1.19.2 introduces Add gzip Content-Encoding decompression (and it works)

Related Solutions

Linux – wget recursive limited to children of URL path

Use the --no-parent option in wget:

--no-parent

Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.

Bash – Get filename from a wget –content-disposition

Here is a way to do it with wget and cut:

wget -nv https://upload.wikimedia.org/wikipedia/commons/5/54/Golden_Gate_Bridge_0002.jpg 2>&1 |cut -d\" -f2

Explanation, wget -nv ... prints out something like this:

2016-11-15 14:58:44 URL:https://upload.wikimedia.org/wikipedia/commons/5/54/Golden_Gate_Bridge_0002.jpg [1072554/1072554] -> "Golden_Gate_Bridge_0002.jpg.22" [1]

The -nv flag on wget just makes it "non-verbose" (See: man wget)

Since wget writes its output to STDERR we have to redirect that to STDOUT before we can extract the text; to do this we add 2>& at the end of the wget. Then to get out just the filename at the end I used cut. The -d\" is to specify that we are using " as a delimiter. The -f2 specifies that we want the second "column", i.e., the data inbetween the first and the second delimiters ".

First column: 2016-11-15 14:58:48 URL:https://upload.wikimedia.org/wikipedia/commons/5/54/Golden_Gate_Bridge_0002.jpg [1072554/1072554] -> "Golden_Gate_Bridge_0002.jpg.23`" [1]

Best Answer

Related Solutions

Linux – wget recursive limited to children of URL path

Bash – Get filename from a wget –content-disposition

Related Question