How to decompress gzipped HTTP response

gziphttp

File req contains the request header:

GET /cd/E11882_01/server.112/e41084/toc.htm HTTP/1.1^M
Host: docs.oracle.com^M
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8^M
Accept-Language: en-US,en;q=0.5^M
Accept-Encoding: gzip, deflate^M
Connection: keep-alive^M
^M

I run:

cat req | nc docs.oracle.com 80 > resp

resp contains:

HTTP/1.1 200 OK^M
Server: Apache^M
ETag: "726bf43b293f9fc8eac0f8f6b7be3a84:1457459134"^M
Last-Modified: Fri, 04 Mar 2016 14:26:34 GMT^M
Accept-Ranges: bytes^M
Content-Type: text/html^M
Vary: Accept-Encoding^M
Content-Encoding: gzip^M
Date: Sat, 18 Jun 2016 07:04:06 GMT^M
Content-Length: 13163^M
Connection: keep-alive^M
^M
^_<8b>^H^@^@^@^@^@^@^@Å}ysã8<92>ïÿó)¸Þ<88>}3ïµËâMÎvy<83>â%ªuµ(Õ1^[^[
Z¢mvÉ<92>[Gu¹?ýf<82>^D^HÉ¦  Òîx^[³]¶¬ü^AH$^R<99><89>Dâç^?óÆîìëÄ<97>î^O^Oëë¿ý<8c>ÿHëds÷ñ"Ý\à^Gi²<82>^?^^ÒC^Bß9<^¦¿^_³ï^_/¾\Î<9d>Kwûð<98>^\²<9b>uz!-·<9b>Cº9|¼<88>ü<8f>éê.½ T<9b>ä!ýxñ=KÿxÜî^NÜ^WÿÈV<87>û<8f>«ô{¶L/É/?IÙ&;dÉúr¿LÖéGùCç'é!ù<91>=^\^_èG^Lwy<9f>ìö)à^\^O·<97>^V~|È^NëôÚK^NÉM²O¥ø×<81>4<80>¡^\<93>»T<9a>¦·é.Ý,SéRró^^ì^?¾Ê)N:z<97>nÒ]rØî¸<9e><8e>wÉr<9d>J<9e>3íJ_z³á^@!¾§»Cº<93>þ>Ü®Ré£´Ú.<8f>^Oðí^?@^CÃtw<97>®¤Oén<9f>m7<92>Ü1õ^Kéê´<9d>Õ^R¨^_ö^_<96>»49¤+®5¥#^[<97>^]ù²£Ïô^?jÆ?^Uë_Ï¨wÛ<9b>íaÏ^Q%ëue^Sd<94>Üwk8T<89><93><80><»ÍR<9e>7¾&w,í²£U<93>í^KF<8c>o9:h{^Zä4ëlóMÚ¥køð<90> <88>ÜïÒÛ<8f>^W^_>\Áÿ²Í*ýñ^AäòB"ãøxÑÛ>@^_^OO<8f>ðó!ýq¸B¡=Gr·<8f>O»ìîþ^LmµÜ><l7<84>äj    _9Aæ<88>^<82>ÿÛÏûå.{<^T^?L^^^_×Ù^Rä^_ð~K¾'ù^_/$i¿[<9e>·÷Ûþ

   ...continues...

Now, apparently the response body is in gzip format. To decompress it, I have copied the response body to resp-body. So, resp-body contains:

^_<8b>^H^@^@^@^@^@^@^@Å}ysã8<92>ïÿó)¸Þ<88>}3ïµËâMÎvy<83>â%ªuµ(Õ1^[^[
Z¢mvÉ<92>[Gu¹?ýf<82>^D^HÉ¦  Òîx^[³]¶¬ü^AH$^R<99><89>Dâç^?óÆîìëÄ<97>î^O^Oëë¿ý<8c>ÿHëds÷ñ"Ý\à^Gi²<82>^?^^ÒC^Bß9<^¦¿^_³ï^_/¾\Î<9d>Kwûð<98>^\²<9b>uz!-·<9b>Cº9|¼<88>ü<8f>éê.½ T<9b>ä!ýxñ=KÿxÜî^NÜ^WÿÈV<87>û<8f>«ô{¶L/É/?IÙ&;dÉúr¿LÖéGùCç'é!ù<91>=^\^_èG^Lwy<9f>ìö)à^\^O·<97>^V~|È^NëôÚK^NÉM²O¥ø×<81>4<80>¡^\<93>»T<9a>¦·é.Ý,SéRró^^ì^?¾Ê)N:z<97>nÒ]rØî¸<9e><8e>wÉr<9d>J<9e>3íJ_z³á^@!¾§»Cº<93>þ>Ü®Ré£´Ú.<8f>^Oðí^?@^CÃtw<97>®¤Oén<9f>m7<92>Ü1õ^Kéê´<9d>Õ^R¨^_ö^_<96>»49¤+®5¥#^[<97>^]ù²£Ïô^?jÆ?^Uë_Ï¨wÛ<9b>íaÏ^Q%ëue^Sd<94>Üwk8T<89><93><80><»ÍR<9e>7¾&w,í²£U<93>í^KF<8c>o9:h{^Zä4ëlóMÚ¥køð<90> <88>ÜïÒÛ<8f>^W^_>\Áÿ²Í*ýñ^AäòB"ãøxÑÛ>@^_^OO<8f>ðó!ýq¸B¡=Gr·<8f>O»ìîþ^LmµÜ><l7<84>äj    _9Aæ<88>^<82>ÿÛÏûå.{<^T^?L^^^_×Ù^Rä^_ð~K¾'ù^_/$i¿[<9e>·÷Ûþ

   ...continues...

Then I have tried gzip -d resp-body but it does not work.

What should I do in order to decompress the response?

Best Answer

Delete the headers and what you'll have left is gzip-compressed data that can be decompressed with gzip -d or zcat. e.g.

sed -e '1,/^[[:space:]]*$/d' resp | gzip -d > resp.decompressed

The sed script deletes the headers - i.e. everything from the first line to the first empty line (/^[[:space:]]*$/).

The [[:space:]] character-class will make the sed script match empty lines and lines containing only space characters (including carriage-returns, ^M)

BTW, a slightly smarter version of this would extract the Content-Encoding: and Content-Type: headers, and use the mime-type from that to decide whether to use cat, lynx -dump, gzip -d, bzip2 -d, xz -d or whatever else to "decode" the data. But that would probably require writing it in perl.

Related Solutions

Gzip – How to Decompress File In Place

Would there be a way of decompressing it 'while deleting it'?

This is what you asked for. But it may not be what you really want. Use at your own risk.

If the 420GB file is stored on a filesystem with sparse file and punch hole support (e.g. ext4, xfs, but not ntfs), it would be possible to read a file and free the read blocks using fallocate --punch-hole. However, if the process is cancelled for any reason, there may be no way to recover since all that's left is a half-deleted, half-uncompressed file. Don't attempt it without making another copy of the source file first.

Very rough proof of concept:

# dd if=/dev/urandom bs=1M count=6000 | pigz --fast > urandom.img.gz
6000+0 records in
6000+0 records out
6291456000 bytes (6.3 GB, 5.9 GiB) copied, 52.2806 s, 120 MB/s
# df -h urandom.img.gz 
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           7.9G  6.0G  2.0G  76% /dev/shm

urandom.img.gz file occupies 76% of available space, so it can't be uncompressed directly. Pipe uncompressed result to md5sum so we can verify later:

# gunzip < urandom.img.gz | md5sum
bc5ed6284fd2d2161296363edaea5a6d  -

Uncompress while hole punching: (this is very rough without any error checking whatsoever)

total=$(stat --format='%s' urandom.img.gz) # bytes
total=$((1+$total/1024/1024)) # MiB
for ((offset=0; offset < $total; offset++))
do
    # read block
    dd bs=1M skip=$offset count=1 if=urandom.img.gz 2> /dev/null
    # delete (punch-hole) blocks we read
    fallocate --punch-hole --offset="$offset"MiB --length=1MiB urandom.img.gz
done | gunzip > urandom.img

Result:

# ls -alh *
-rw-r--r-- 1 root root 5.9G Jan 31 15:14 urandom.img
-rw-r--r-- 1 root root 5.9G Jan 31 15:14 urandom.img.gz
# du -hcs *
5.9G    urandom.img
0       urandom.img.gz
5.9G    total
# md5sum urandom.img
bc5ed6284fd2d2161296363edaea5a6d  urandom.img

The checksum matches, the size of the source file reduced from 6GB to 0 while it was uncompressed in place.

But there are so many things that can go wrong... better don't do it at all or if you really have to, at least use a program that does saner error checking. The loop above does not guarantee at all that the data was read and processed before it gets deleted. If dd or gunzip returns an error for any reason, fallocate still happily tosses it... so if you must use this approach better write a saner read-and-eat program.

How to decompress only a portion of a file

You could decompress to standard output and feed it through something like head to only capture a bit of it:

gunzip -c file.gz | head -c 20M >file.part

The -c flag to head requires the head implementation that is provided by GNU coreutils.

dd may also be used:

gunzip -c file.gz | dd of=file.part bs=1M count=20

Both of these pipelines will copy the first 20 MiB of the uncompressed file to file.part.

Best Answer

Related Solutions

Gzip – How to Decompress File In Place

How to decompress only a portion of a file

Related Question