Generating sets of files that fit on a given media size for tar -T

disk-usagescriptingtar

Can anyone suggest a script that will take as input the name of one or more directories and a media size, and output lists of files for input to tar using -T (assuming no compression)?

scdbackup/sdvdbackup sort of does this, but it's full of bloat that I don't need. So basically looking for something like this:

./splitTars file1 file2 .... 2.0T

where file can be a file or directory, and the last argument is the size of the media (e.g. 2TB). It should then output a file list for each tar archive and give a warning for files that are too big to fit on the media.

If nothing like this exists, one way to do it would be to create the list of files using find, re-arrange them in increasing or decreasing size, then start cutting the list up into pieces.

Best Answer

I think you have the knapsack problem for file sizes.

Which basically means, given a set of files, find the most optimal groups of files which can go into a backup media of a fixed size (our knapsack). Then you want to use the sets indicated to create tar archives, and transfer them one at a time from your local machine using whatever hard drive or flash drive you have.

I am outlining a sample solution in python - suit according to your programming skills.

Take all the input arguments in a python script, and find the file sizes of individual files or directories. You can possibly call a du -sm on each file or directory argument so that python doesn't have to do the hard work of finding individual sizes of directories.
Eliminate those over your media limit right away.
Add the rest to a list, and apply the knapsack algorithm to the entire set. Plenty of examples are around, like Mike's solution here.
Eliminate the ones which are added to the knapsack in one round, and return to step 3 with the leftover list.
Repeat steps 3-4-3 above until all the remaining items in the list can fit into one knapsack (i.e. the total size of leftover items is less than the media limit).

That should be it!

Related Solutions

Tar Command – Create Archive Excluding Hidden Files

You posted in a comment that you are working on a Mac OS X system. This is an important clue to the purpose of these ._* files.

These ._* archive entries are chunks of AppleDouble data that contain the extra information associated with the corresponding file (the one without the ._ prefix). They are generated by the Mac OS X–specific copyfile(3) family of functions. The AppleDouble blobs store access control data (ACLs) and extended attributes (commonly, Finder flags and “resource forks”, but xattrs can be used to store any kind of data).

The system-supplied Mac OS X archive tools (bsdtar (also symlinked as tar), gnutar, and pax) will generate a ._* archive member for any file that has any extended information associated with it; in “unarchive” mode, they will also decode those archive members and apply the resulting extended information to the associated file. This creates a “full fidelity” archive for use on Mac OS X systems by preserving and later extracting all the information that the HFS+ filesystem can store.

The corresponding archive tools on other systems do not know to give special handling to these ._* files, so they are unpacked as normal files. Since such files are fairly useless on other systems, they are often seen as “junk files”. Correspondingly, if a non–Mac OS X system generates an archive that includes normal files that start with ._, the Mac OS X unarchiving tools will try to decode those files as extended information.

There is, however an undocumented(?) way to make the system-supplied Mac OS X archivers behave like they do on other Unixy systems: the COPYFILE_DISABLE environment variable. Setting this variable (to any value, even the empty string), will prevent the archivers from generating ._* archive members to represent any extended information associated with the archived files. Its presence will also prevent the archivers from trying to interpret such archive members as extended information.

COPYFILE_DISABLE=1 tar czf new.tar.gz …
COPYFILE_DISABLE=1 tar xzf unixy.tar.gz …

You might set this variable in your shell’s initialization file if you want to work this way more often than not.

# disable special creation/extraction of ._* files by tar, etc. on Mac OS X
COPYFILE_DISABLE=1; export COPYFILE_DISABLE

Then, when you need to re-enable the feature (to preserve/restore the extended information), you can “unset” the variable for individual commands:

(unset COPYFILE_DISABLE; tar czf new-osx.tar.gz …)

The archivers on Mac OS X 10.4 also do something similar, though they use a different environment variable: COPY_EXTENDED_ATTRIBUTES_DISABLE

Tar – How to List Files Only with Tar

I don't see a way to do it from the man page, but you can always filter the results. The following assumes no newlines in your file names:

tar tzf your_archive | awk -F/ '{ if($NF != "") print $NF }'

How it works

By setting the field separator to /, the last field awk knows about ($NF) is either the file name if it's processing a file name or empty if it's processing a directory name (tar adds a trailing slash to directory names). So, we're basically telling awk to print the last field if it's not empty.

Best Answer

Related Solutions

Tar Command – Create Archive Excluding Hidden Files

Tar – How to List Files Only with Tar

Related Question