How to compress files in-place

command linecompressiondisk-usagegziptar

I have a machine with 90% hard-disk usage. I want to compress its 500+ log files into a smaller new file. However, the hard disk is too small to keep both the original files and the compressed ones.

So what I need is to compress all log files into a single new file one by one, deleting each original once compressed.

How can I do that in Linux?

Best Answer

gzip or bzip2 will compress the file and remove the non-compressed one automatically (this is their default behaviour).

However, keep in mind that while the compressing process, both files will exists.

If you want to compress log files (ie: files containing text), you may prefer bzip2, since it has a better ratio for text files.

bzip2 -9 myfile       # will produce myfile.bz2

Comparison and examples:

$ ls -l myfile
-rw-rw-r-- 1 apaul apaul 585999 29 april 10:09 myfile

$ bzip2 -9 myfile

$ ls -l myfile*
-rw-rw-r-- 1 apaul apaul 115780 29 april 10:09 myfile.bz2

$ bunzip2 myfile.bz2

$ gzip -9 myfile

$ ls -l myfile*
-rw-rw-r-- 1 apaul apaul 146234 29 april 10:09 myfile.gz

UPDATE as @Jjoao told me in a comment, interestingly, xz seems to have a best ratio on plain files with its default options:

$ xz -9 myfile

$ ls -l myfile*
-rw-rw-r-- 1 apaul apaul 109384 29 april 10:09 myfile.xz

For more informations, here is an interesting benchmark for different tools: http://binfalse.de/2011/04/04/comparison-of-compression/

For the example above, I use -9 for a best compression ratio, but if the time needed to compress data is more important than the ratio, you'd better not use it (use a lower option, ie -1, or something between).

Related Solutions

How should I combine many compressed files into one archive

Since tar files are a streaming format — you can cat two of them together and get an almost-correct result — you don't need to extract them to disk at all to do this. You can decompress (only) the files, concatenate them together, and recompress that stream:

xzcat *.tar.xz | xz -c > combined.tar.xz

combined.tar.xz will be a compressed tarball of all the files in the component tarballs that is only slightly corrupt. To extract, you'll have to use the --ignore-zeros option (in GNU tar), because the archives do have an "end-of-file" marker that will appear in the middle of the result. Other than that, though, everything will work correctly.

GNU tar also supports a --concatenate mode for producing combined archives. That has the same limitations as above — you must use --ignore-zeros to extract — but it doesn't work with compressed archives. You can build something up to trick it into working using process substitution, but it's a hassle and even more fragile.

If there are files that appear more than once in different tar files, this won't work properly, but you've got that problem regardless. Otherwise this will give you what you want — piping the output through xz is how tar compresses its output anyway.

If archives that only work with a particular tar implementation aren't adequate for your purposes, appending to the archive with r is your friend:

tar cJf combined.tar.xz dummy-file
for x in db-*.tar.xz
do
    mkdir tmp
    pushd tmp
    tar xJf "../$x"
    tar rJf ../combined.tar.xz .
    popd
    rm -r tmp
done

This only ever extracts a single archive at a time, so the working space is limited to the size of a single archive's contents. The compression is streaming just like it would have been had you made the final archive all at once, so it will be as good as it ever could have been. You do a lot of excess decompression and recompression that will make this slower than the cat versions, but the resulting archive will work anywhere without any special support.

Note that — depending on what exactly you want — just adding the uncompressed tar files themselves to an archive might suffice. They will compress (almost) exactly as well as their contents in a single file, and it will reduce the compression overhead for each file. This would look something like:

tar cJf combined.tar.xz dummy-file
for x in db-*.tar.xz
do
    xz -dk "$x"
    tar rJf combined.tar.xz "${x%.xz}"
    rm -f "${x%.xz}"
done

This is slightly less efficient in terms of the final compressed size because there are extra tar headers in the stream, but saves some time on extracting and re-adding all the files as files. You'd end up with combined.tar.xz containing many (uncompressed) db-*.tar files.

Cpio VS tar – what the best archive solution in order to compress hundred of directories to one file

cpio (the older of the two utilities counting shipping with UNIX) only used to have hard link support for the -p option (i.e. copying from filesystem to filesystem), but the newc output format (not the default one cpio uses) also supports hard links in the output file. (GNU) tar supports hard links without any special options. A comparison can be found here.

So if you run a test with a large hard linked file and 100 small files:

$ mkdir tmp
$ dd if=/dev/urandom of=tmp/blabla bs=1k count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1,0 MB) copied, 0,0764345 s, 13,7 MB/s
$ ln tmp/blabla tmp/hardlink
$ tar cvf tmp.tar tmp
$ find tmp -print0 | cpio -0o > out.cpio
4104 blocks
$ find tmp -print0 | cpio -0o --format=newc > outnewc.cpio
2074 blocks
$ xz -9k out.tar outnewc.cpio
$ bzip2 -9k out.tar outnewc.cpio
$ ls -l out*
-rw-rw-r-- 1 anthon users 2101248 Nov 23 12:30 out.cpio
-rw-rw-r-- 1 anthon users 1061888 Nov 23 12:30 outnewc.cpio
-rw-rw-r-- 1 anthon users 1055935 Nov 23 12:30 outnewc.cpio.bz2
-rw-rw-r-- 1 anthon users 1050652 Nov 23 12:30 outnewc.cpio.xz
-rw-rw-r-- 1 anthon users 1157120 Nov 23 12:30 out.tar
-rw-rw-r-- 1 anthon users 1055402 Nov 23 12:30 out.tar.bz2
-rw-rw-r-- 1 anthon users 1050928 Nov 23 12:30 out.tar.xz

You see that the uncompressed versions (outnewc.cpio and out.tar) give cpio an advantage and that compressing them with xz -9 gives better results than bzip2 -9 (gzip is usually much worse than either). And that compression with xz minimizes the tar and cpio output difference. Compression is however heavily dependent on the data, and also on the ordering of the data in the archives, so you should really test this on (a sample of) your real data.

If you want to compress in parallel, you might want to look at my article here

Best Answer

Related Solutions

How should I combine many compressed files into one archive

Cpio VS tar – what the best archive solution in order to compress hundred of directories to one file

Related Question