I have about 200 GB of log data generated daily, distributed among about 150 different log files.
I have a script that moves the files to a temporary location and does a tar-bz2 on the temporary directory.
I get good results as 200 GB logs are compressed to about 12-15 GB.
The problem is that it takes forever to compress the files. The cron job runs at 2:30 AM daily and continues to run till 5:00-6:00 PM.
Is there a way to improve the speed of the compression and complete the job faster? Any ideas?
Don't worry about other processes and all, the location where the compression happens is on a NAS, and I can run mount the NAS on a dedicated VM and run the compression script from there.
Here is the output of top for reference:
top - 15:53:50 up 1093 days, 6:36, 1 user, load average: 1.00, 1.05, 1.07
Tasks: 101 total, 3 running, 98 sleeping, 0 stopped, 0 zombie
Cpu(s): 25.1%us, 0.7%sy, 0.0%ni, 74.1%id, 0.0%wa, 0.0%hi, 0.1%si, 0.1%st
Mem: 8388608k total, 8334844k used, 53764k free, 9800k buffers
Swap: 12550136k total, 488k used, 12549648k free, 4936168k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7086 appmon 18 0 13256 7880 440 R 96.7 0.1 791:16.83 bzip2
7085 appmon 18 0 19452 1148 856 S 0.0 0.0 1:45.41 tar cjvf /nwk_storelogs/compressed_logs/compressed_logs_2016_30_04.tar.bz2 /nwk_storelogs/temp/ASPEN-GC-32459:nkp-aspn-1014.log /nwk_stor
30756 appmon 15 0 85952 1944 1000 S 0.0 0.0 0:00.00 sshd: appmon@pts/0
30757 appmon 15 0 64884 1816 1032 S 0.0 0.0 0:00.01 -tcsh
Best Answer
The first step is to figure out what the bottleneck is: is it disk I/O, network I/O, or CPU?
If the bottleneck is the disk I/O, there isn't much you can do. Make sure that the disks don't serve many parallel requests as that can only decrease performance.
If the bottleneck is the network I/O, run the compression process on the machine where the files are stored: running it on a machine with a beefier CPU only helps if the CPU is the bottleneck.
If the bottleneck is the CPU, then the first thing to consider is using a faster compression algorithm. Bzip2 isn't necessarily a bad choice — its main weakness is decompression speed — but you could use gzip and sacrifice some size for compression speed, or try out other formats such as lzop or lzma. You might also tune the compression level: bzip2 defaults to
-9
(maximum block size, so maximum compression, but also longest compression time); set the environment variableBZIP2
to a value like-3
to try compression level 3. This thread and this thread discuss common compression algorithms; in particular this blog post cited by derobert gives some benchmarks which suggest thatgzip -9
orbzip2
with a low level might be a good compromise compared tobzip2 -9
. This other benchmark which also includes lzma (the algorithm of 7zip, so you might use7z
instead oftar --lzma
) suggests thatlzma
at a low level can reach the bzip2 compression ratio faster. Just about any choice other than bzip2 will improve decompression time. Keep in mind that the compression ratio depends on the data, and the compression speed depends on the version of the compression program, on how it was compiled, and on the CPU it's executed on.Another option if the bottleneck is the CPU and you have multiple cores is to parallelize the compression. There are two ways to do that. One that works with any compression algorithm is to compress the files separately (either individually or in a few groups) and use
parallel
to run the archiving/compression commands in parallel. This may reduce the compression ratio but increases the speed of retrieval of an individual file and works with any tool. The other approach is to use a parallel implementation of the compression tool; this thread lists several.