Linux – Time to zip very large (100G) files

gziplinux

I find myself having to compress a number of very large files (80-ish GB), and I am surprised at the (lack of) speed my system is exhibiting. I get about 500 MB / min conversion speed; using top, I seem to be using a single CPU at approximately 100%.

I am pretty sure it's not (just) disk access speed, since creating a tar file (that's how the 80G file was created) took just a few minutes (maybe 5 or 10), but after more than 2 hours my simple gzip command is still not done.

In summary:

tar -cvf myStuff.tar myDir/*

Took <5 minutes to create an 87 G tar file

gzip myStuff.tar

Took two hours and 10 minutes, creating a 55G zip file.

My question: Is this normal? Are there certain options in gzip to speed things up? Would it be faster to concatenate the commands and use tar -cvfz? I saw reference to pigzParallel Implementation of GZip – but unfortunatly I cannot install software on the machine I am using, so that is not an option for me. See for example this earlier question.

I am intending to try some of these options myself and time them – but it is quite likely that I will not hit "the magic combination" of options. I am hoping that someone on this site knows the right trick to speed things up.

When I have the results of other trials available I will update this question – but if anyone has a particularly good trick available, I would really appreciate it. Maybe the gzip just takes more processing time than I realized…

UPDATE

As promised, I tried the tricks suggsted below: change the amount of compression, and change the destination of the file. I got the following results for a tar that was about 4.1GB:

flag    user      system   size    sameDisk
-1     189.77s    13.64s  2.786G     +7.2s 
-2     197.20s    12.88s  2.776G     +3.4s
-3     207.03s    10.49s  2.739G     +1.2s
-4     223.28s    13.73s  2.735G     +0.9s
-5     237.79s     9.28s  2.704G     -0.4s
-6     271.69s    14.56s  2.700G     +1.4s
-7     307.70s    10.97s  2.699G     +0.9s
-8     528.66s    10.51s  2.698G     -6.3s
-9     722.61s    12.24s  2.698G     -4.0s

So yes, changing the flag from the default -6 to the fastest -1 gives me a 30% speedup, with (for my data) hardly any change to the size of the zip file. Whether I'm using the same disk or another one makes essentially no difference (I would have to run this multiple times to get any statistical significance).

If anyone is interested, I generated these timing benchmarks using the following two scripts:

#!/bin/bash
# compare compression speeds with different options
sameDisk='./'
otherDisk='/tmp/'
sourceDir='/dirToCompress'
logFile='./timerOutput'
rm $logFile

for i in {1..9}
  do  /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $sameDisk $logFile
  do  /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $otherDisk $logFile
done

And the second script (compressWith):

#!/bin/bash
# use: compressWith sourceDir compressionFlag destinationDisk logFile
echo "compressing $1 to $3 with setting $2" >> $4
tar -c $1 | gzip -$2 > $3test-$2.tar.gz

Three things to note:

  1. Using /usr/bin/time rather than time, since the built-in command of bash has many fewer options than the GNU command
  2. I did not bother using the --format option although that would make the log file easier to read
  3. I used a script-in-a-script since time seemed to operate only on the first command in a piped sequence (so I made it look like a single command…).

With all this learnt, my conclusions are

  1. Speed things up with the -1 flag (accepted answer)
  2. Much more time is spend compressing the data than reading from disk
  3. Invest in faster compression software (pigz seems like a good choice).
  4. If you have multiple files to compress you can put each gzip command in its own thread and use more of the available CPU (poor man’s pigz)

Thanks everyone who helped me learn all this!

Best Answer

You can change the speed of gzip using --fast --best or -# where # is a number between 1 and 9 (1 is fastest but less compression, 9 is slowest but more compression). By default gzip runs at level 6.

Related Question