Tar Gzip – How to Efficiently Remove Files from Large .tgz

gziptar

Assume i have an gzip compressed tar-ball compressedArchive.tgz (+100 files, totaling +5gb).

What would be the fastest way to remove all entries matching a given filename pattern for example prefix*.jpg and then store the remains in a gzip:ed tar-ball again?

Replacing the old archive or creating a new one is not important, whichever is fastest.

Best Answer

With GNU tar, you can do:

pigz -d < file.tgz |
  tar --delete --wildcards -f - '*/prefix*.jpg' |
  pigz > newfile.tgz

With bsdtar:

pigz -d < file.tgz |
  bsdtar -cf - --exclude='*/prefix*.jpg' @- |
  pigz > newfile.tgz

(pigz being the multi-threaded version of gzip).

You could overwrite the file over itself like:

{ pigz -d < file.tgz |
    tar --delete --wildcards -f - '*/prefix*.jpg' |
    pigz &&
    perl -e 'truncate STDOUT, tell STDOUT'
} 1<> file.tgz

But that's quite risky, especially if the result ends up being less compressed than the original file (in which case, the second pigz may end up overwriting areas of the file which the first one has not read yet).

Related Question