Linux – Tar Splitting Into Standalone Volumes

archivebackuplinuxtar

I have some directories with over 100Gb of data. I'm trying to archive them into smaller volumes i.e. 10Gb each that are independent / standalone.

The problem is if I use tar + split, it results in multiple tar parts that are not independent. I cannot just extract files from one of the parts, unless I cat / combine all back into a single large file first.

I've also tried using tar -c -L1000M ... to split volumes, but that doesn't work either and there's a problem with long filenames getting truncated.

Tried star as well, but seems like its split volumes are not independent either; while 7zip does not preserve permissions in unix.

The reason I wish to have independent split archives is for safety purposes, in case one of the split files are corrupted, I can still retrieve data from the other archives. It is also much faster if I wish to only extract specific files/folders, without needing to combine all the archives back into a single large volume.

How best do I achieve this? Thank you.


SOLUTION FOUND

I have found a solution using tar, as suggested by @Haxiel's answer. The answer has been posted below.

Note that there may still be a file or two that lost if it crosses the boundary of a volume and you don't have the next volume available, but at least the separate volumes could be independently extracted even if the other parts are missing.

Best Answer

This is not a perfect solution, but GNU tar's multi-volume archives seem to be close to what you're looking for. This option is already mentioned in your question, but I would like to add a reference from the GNU tar manual that clarifies why this is a possible option:

Multi-volume archive is a single tar archive, stored on several media volumes of fixed size. Although in this section we will often call `volume' a tape, there is absolutely no requirement for multi-volume archives to be stored on tapes. Instead, they can use whatever media type the user finds convenient, they can even be located on files.

When creating a multi-volume archive, GNU tar continues to fill current volume until it runs out of space, then it switches to next volume (usually the operator is queried to replace the tape on this point), and continues working on the new volume. This operation continues until all requested files are dumped. If GNU tar detects end of media while dumping a file, such a file is archived in split form. Some very big files can even be split across several volumes.

Each volume is itself a valid GNU tar archive, so it can be read without any special options. Consequently any file member residing entirely on one volume can be extracted or otherwise operated upon without needing the other volume. Sure enough, to extract a split member you would need all volumes its parts reside on.

Multi-volume archives suffer from several limitations. In particular, they cannot be compressed.

With this definition, the only files that would be a problem are the ones that are split across the size boundary. Files that are fully contained within a single volume could be treated as independent of the other volumes.

For each volume, it is possible to identify the split files using the -v option.

$ tar -tf multi-test2.tar -v
M--------- 0/0          658432 1970-01-01 03:00 file1--Continued at byte 7341568--
-rw-r--r-- test/users 4000000 2019-03-06 12:12 file2

The files that are fully contained can be extracted as you would with a single archive. tar seems to complain about the split file being incomplete, but it is able to extract the complete files without any problems.

The split files can also be extracted as a single unit from multiple volumes by using the -M option, which will prompt you to provide the name of the next volume. The usage is documented here. Instead, if you prefer to concatenate the volumes to a single archive, you can consider the tarcat utility as well.

Related Question