I have a large folder with 30M small files. I hope to backup the folder into 30 archives, each tar.gz file will have 1M files. The reason to split into multi archives is that to untar one single large archive will take month.. pipe tar to split also won't work because when untar the file, I have to cat all archives together.
Also, I hope not to mv each file to a new dir, because even ls is very painful for this huge folder.
Best Answer
I wrote this bash script to do it. It basically forms an array containing the names of the files to go into each tar, then starts
tar
in parallel on all of them. It might not be the most efficient way, but it will get the job done as you want. I can expect it to consume large amounts of memory though.You will need to adjust the options in the start of the script. You might also want to change the tar options
cvjf
in the last line (like removing the verbose outputv
for performance or changing compressionj
toz
, etc ...).Script
Explanation
First, all the file names that match the selected pattern are stored in the array
files
. Next, the for loop slices this array and forms strings from the slices. The number of the slices is equal to the number of the desired tarballs. The resulting strings are stored in the arraytar_files
. The for loop also adds the name of the resulting tarball to the beginning of each string. The elements oftar_files
take the following form (assuming 5 files/tarball):The last line of the script,
xargs
is used to start multipletar
processes (up to the maximum specified number) where each one will process one element oftar_files
array in parallel.Test
List of files:
Generated Tarballs: $ls /tmp/tar* tar0.tar.bz2 tar1.tar.bz2 tar2.tar.bz2 tar3.tar.bz2