Context
I'm compressing ~1.3 GB folders each filled with 1440 JSON files and find that there's a 15-fold difference between using the tar
command and Python's built-in tarfile
library on macOS or Raspbian 10 (Buster)
Minimal working example
This script compares both methods:
#!/usr/bin/env python3
from pathlib import Path
from subprocess import call
import tarfile
fullpath = Path("/Users/user/Desktop/temp/tar/2021-03-11")
zsh_out = Path(fullpath.parent, "zsh-archive.tar.xz")
py_out = Path(fullpath.parent, "py-archive.tar.xz")
# tar using terminal
# tar cJf zsh-archive.tar.xz folderpath
call(["tar", "cJf", zsh_out, fullpath])
# tar using tarfile library
with tarfile.open(py_out, "w:xz") as tar:
tar.add(fullpath, arcname=fullpath.stem)
# Print filesizes
print(f"zsh tar filesize: {round(Path(zsh_out).stat().st_size/(1024*1024), 2)} MB")
print(f"py tar filesize: {round(Path(py_out).stat().st_size/(1024*1024), 2)} MB")
The output is:
zsh tar filesize: 23.7 MB
py tar filesize: 1.49 MB
The versions I use are as follows:
tar
on macOS:bsdtar 3.3.2 - libarchive 3.3.2 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.6
tar
on Raspbian 10:xz (XZ Utils) 5.2.4 liblzma 5.2.4
tarfile
Python library:0.9.0
Things I've tried
After compression, I've extracted both archives and compared the resulting folder with:
diff -r py-archive-expanded zsh-archive-expanded
There was no difference.
If I compare the two tar archives directly, they seem different:
➜ diff zsh-archive.tar.xz py-archive.tar.xz
Binary files zsh-archive.tar.xz and py-archive.tar.xz differ
If I inspect the archives with Quicklook (and the Betterzip plugin) I see that the files in the archive are ordered in a different way:
Left is zsh-archive.tar.xz
, right is py-archive.tar.xz
:
The zsh archive uses an unknown order, and the Python archive orders the file by modification date. I am not sure if that matters.
Question
What is going on? Am I losing something by using the Python library to compress my data? Is the 15-fold difference in size an indicator of some issue? Or can I safely go ahead and use the efficient Python implementation?
Best Answer
Short answer: yes, it is safe to use Python
tarlib
to compress the data, nothing is lost compared to BSDtar
.Underlying issue: sorting
I think the underlying issue is that BSD
tar
and GNUtar
without any sort options put the files in the archive in an undefined order.GNU
tar
has a--sort
option:Testing GNU
tar
To test this I installed GNU
tar
on my Mac with:And then tarred the same folder, but with the
--sort
option:The
zsh-archive-sorted.tar.xz
archive is 1.5 MB, equal to the size of the archive created by the Python library.Concatenating in sorted order
The effect sorting has on the final archive size is further demonstrated by first concatenating all the JSON files sorted by name (which has the creation unixtime at the beginning of it) and then tarring with BSD
tar
:The
zsh-cat-archive.tar.xz
archive is also 1.5 MB.Python
tarfile
sortingFinally, the documentation of the Python
TarFile.add
function confirms that Pythontarfile
sorts by default:Why sorting matters
I think the reason sorting has such an impact in my case is as follows:
My JSON files contain locations of hundreds of vehicles. Every minute I read out all the locations, but only a few of these locations have a different value from minute to minute.
By sorting the files by name, two subsequent files have little different characters between them. Apparently this is very favourable for the compression efficiency.