Cpio VS tar – what the best archive solution in order to compress hundred of directories to one file

compressioncpiotar

I have hundreds of directories under /var/Recording, and each directory there will have subdirectories, including files, hard links and soft links.

I want to compress all directories under /var/Recording to create a single compressed file.

Which command would give me the best compression? tar or cpio (especially considering the fact that I have hard and soft link files).

Also, what is the right syntax of the tar/cpio command ?

  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1034
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1033
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1038
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1037
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1036
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1041
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1040
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1039
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1044
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1043
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1042
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1047
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1046
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1045
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1049
  drwxr-x--- 2 root root 4096 Nov 22 18:57 458ca4e8-0edf-4204-9f9b-9c3dc02953c5.1048
  .
  .
  .
  .
  .

Best Answer

cpio (the older of the two utilities counting shipping with UNIX) only used to have hard link support for the -p option (i.e. copying from filesystem to filesystem), but the newc output format (not the default one cpio uses) also supports hard links in the output file. (GNU) tar supports hard links without any special options. A comparison can be found here.

So if you run a test with a large hard linked file and 100 small files:

$ mkdir tmp
$ dd if=/dev/urandom of=tmp/blabla bs=1k count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1,0 MB) copied, 0,0764345 s, 13,7 MB/s
$ ln tmp/blabla tmp/hardlink
$ tar cvf tmp.tar tmp
$ find tmp -print0 | cpio -0o > out.cpio
4104 blocks
$ find tmp -print0 | cpio -0o --format=newc > outnewc.cpio
2074 blocks
$ xz -9k out.tar outnewc.cpio
$ bzip2 -9k out.tar outnewc.cpio
$ ls -l out*
-rw-rw-r-- 1 anthon users 2101248 Nov 23 12:30 out.cpio
-rw-rw-r-- 1 anthon users 1061888 Nov 23 12:30 outnewc.cpio
-rw-rw-r-- 1 anthon users 1055935 Nov 23 12:30 outnewc.cpio.bz2
-rw-rw-r-- 1 anthon users 1050652 Nov 23 12:30 outnewc.cpio.xz
-rw-rw-r-- 1 anthon users 1157120 Nov 23 12:30 out.tar
-rw-rw-r-- 1 anthon users 1055402 Nov 23 12:30 out.tar.bz2
-rw-rw-r-- 1 anthon users 1050928 Nov 23 12:30 out.tar.xz

You see that the uncompressed versions (outnewc.cpio and out.tar) give cpio an advantage and that compressing them with xz -9 gives better results than bzip2 -9 (gzip is usually much worse than either). And that compression with xz minimizes the tar and cpio output difference. Compression is however heavily dependent on the data, and also on the ordering of the data in the archives, so you should really test this on (a sample of) your real data.

If you want to compress in parallel, you might want to look at my article here

Related Question