Shell – Compress a large number of large files fast

bzip2optimizationshell-scripttar

I have about 200 GB of log data generated daily, distributed among about 150 different log files.

I have a script that moves the files to a temporary location and does a tar-bz2 on the temporary directory.

I get good results as 200 GB logs are compressed to about 12-15 GB.

The problem is that it takes forever to compress the files. The cron job runs at 2:30 AM daily and continues to run till 5:00-6:00 PM.

Is there a way to improve the speed of the compression and complete the job faster? Any ideas?

Don't worry about other processes and all, the location where the compression happens is on a NAS, and I can run mount the NAS on a dedicated VM and run the compression script from there.

Here is the output of top for reference:

top - 15:53:50 up 1093 days,  6:36,  1 user,  load average: 1.00, 1.05, 1.07
Tasks: 101 total,   3 running,  98 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.1%us,  0.7%sy,  0.0%ni, 74.1%id,  0.0%wa,  0.0%hi,  0.1%si,  0.1%st
Mem:   8388608k total,  8334844k used,    53764k free,     9800k buffers
Swap: 12550136k total,      488k used, 12549648k free,  4936168k cached
 PID  USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7086 appmon    18   0 13256 7880  440 R 96.7  0.1 791:16.83 bzip2
7085  appmon    18   0 19452 1148  856 S  0.0  0.0   1:45.41 tar cjvf /nwk_storelogs/compressed_logs/compressed_logs_2016_30_04.tar.bz2 /nwk_storelogs/temp/ASPEN-GC-32459:nkp-aspn-1014.log /nwk_stor
30756 appmon    15   0 85952 1944 1000 S  0.0  0.0   0:00.00 sshd: appmon@pts/0
30757 appmon    15   0 64884 1816 1032 S  0.0  0.0   0:00.01 -tcsh

Best Answer

The first step is to figure out what the bottleneck is: is it disk I/O, network I/O, or CPU?

If the bottleneck is the disk I/O, there isn't much you can do. Make sure that the disks don't serve many parallel requests as that can only decrease performance.

If the bottleneck is the network I/O, run the compression process on the machine where the files are stored: running it on a machine with a beefier CPU only helps if the CPU is the bottleneck.

If the bottleneck is the CPU, then the first thing to consider is using a faster compression algorithm. Bzip2 isn't necessarily a bad choice — its main weakness is decompression speed — but you could use gzip and sacrifice some size for compression speed, or try out other formats such as lzop or lzma. You might also tune the compression level: bzip2 defaults to -9 (maximum block size, so maximum compression, but also longest compression time); set the environment variable BZIP2 to a value like -3 to try compression level 3. This thread and this thread discuss common compression algorithms; in particular this blog post cited by derobert gives some benchmarks which suggest that gzip -9 or bzip2 with a low level might be a good compromise compared to bzip2 -9. This other benchmark which also includes lzma (the algorithm of 7zip, so you might use 7z instead of tar --lzma) suggests that lzma at a low level can reach the bzip2 compression ratio faster. Just about any choice other than bzip2 will improve decompression time. Keep in mind that the compression ratio depends on the data, and the compression speed depends on the version of the compression program, on how it was compiled, and on the CPU it's executed on.

Another option if the bottleneck is the CPU and you have multiple cores is to parallelize the compression. There are two ways to do that. One that works with any compression algorithm is to compress the files separately (either individually or in a few groups) and use parallel to run the archiving/compression commands in parallel. This may reduce the compression ratio but increases the speed of retrieval of an individual file and works with any tool. The other approach is to use a parallel implementation of the compression tool; this thread lists several.

Related Question