Macos – Why are tar.xz files 15x smaller when using Python’s tar library compared to macOS tar

compressionmacospythontarzsh

Context

I'm compressing ~1.3 GB folders each filled with 1440 JSON files and find that there's a 15-fold difference between using the tar command and Python's built-in tarfile library on macOS or Raspbian 10 (Buster)

Minimal working example

This script compares both methods:

#!/usr/bin/env python3

from pathlib import Path
from subprocess import call
import tarfile

fullpath = Path("/Users/user/Desktop/temp/tar/2021-03-11")
zsh_out = Path(fullpath.parent, "zsh-archive.tar.xz")
py_out = Path(fullpath.parent, "py-archive.tar.xz")

# tar using terminal
# tar cJf zsh-archive.tar.xz folderpath
call(["tar", "cJf", zsh_out, fullpath])

# tar using tarfile library
with tarfile.open(py_out, "w:xz") as tar:
    tar.add(fullpath, arcname=fullpath.stem)

# Print filesizes
print(f"zsh tar filesize: {round(Path(zsh_out).stat().st_size/(1024*1024), 2)} MB")
print(f"py tar filesize: {round(Path(py_out).stat().st_size/(1024*1024), 2)} MB")

The output is:

zsh tar filesize: 23.7 MB
py tar filesize: 1.49 MB

The versions I use are as follows:

tar on macOS: bsdtar 3.3.2 - libarchive 3.3.2 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.6
tar on Raspbian 10: xz (XZ Utils) 5.2.4 liblzma 5.2.4
tarfile Python library: 0.9.0

Things I've tried

After compression, I've extracted both archives and compared the resulting folder with:

diff -r py-archive-expanded zsh-archive-expanded

There was no difference.

If I compare the two tar archives directly, they seem different:

➜ diff zsh-archive.tar.xz py-archive.tar.xz
Binary files zsh-archive.tar.xz and py-archive.tar.xz differ

If I inspect the archives with Quicklook (and the Betterzip plugin) I see that the files in the archive are ordered in a different way:

Left is zsh-archive.tar.xz, right is py-archive.tar.xz:

The zsh archive uses an unknown order, and the Python archive orders the file by modification date. I am not sure if that matters.

Question

What is going on? Am I losing something by using the Python library to compress my data? Is the 15-fold difference in size an indicator of some issue? Or can I safely go ahead and use the efficient Python implementation?

Best Answer

Short answer: yes, it is safe to use Python tarlib to compress the data, nothing is lost compared to BSD tar.

Underlying issue: sorting

I think the underlying issue is that BSD tar and GNU tar without any sort options put the files in the archive in an undefined order.

GNU tar has a --sort option:

sort directory entries according to ORDER, which is one of none, name, or inode.
The default is --sort=none, which stores archive members in the same order as returned by the operating system.

Testing GNU `tar`

To test this I installed GNU tar on my Mac with:

brew install gnu-tar

And then tarred the same folder, but with the --sort option:

gtar --sort='name' -cJf zsh-archive-sorted.tar.xz /Users/user/Desktop/temp/tar/2021-03-11

The zsh-archive-sorted.tar.xz archive is 1.5 MB, equal to the size of the archive created by the Python library.

Concatenating in sorted order

The effect sorting has on the final archive size is further demonstrated by first concatenating all the JSON files sorted by name (which has the creation unixtime at the beginning of it) and then tarring with BSD tar:

cat *.json > all.txt
tar cJf zsh-cat-archive.tar.xz all.txt

The zsh-cat-archive.tar.xz archive is also 1.5 MB.

Python `tarfile` sorting

Finally, the documentation of the Python TarFile.add function confirms that Python tarfile sorts by default:

Directories are added recursively by default. This can be avoided by setting recursive to False. Recursion adds entries in sorted order.

Why sorting matters

I think the reason sorting has such an impact in my case is as follows:

My JSON files contain locations of hundreds of vehicles. Every minute I read out all the locations, but only a few of these locations have a different value from minute to minute.
By sorting the files by name, two subsequent files have little different characters between them. Apparently this is very favourable for the compression efficiency.

Format Details

A tar file is the concatenation of one or more files. Each file is preceded by a 512-byte header record. The file data is written unaltered except that its length is rounded up to a multiple of 512 bytes and the extra space is zero filled. The end of an archive is marked by at least two consecutive zero-filled records.

GZIP compresses a single file into another single file, but does not create archives.

File Format

...Although its file format also allows for multiple such streams to be concatenated (zipped files are simply decompressed concatenated as if they were originally one file), gzip is normally used to compress just single files.[4] Compressed archives are typically created by assembling collections of files into a single tar archive, and then compressing that archive with gzip.

Linux – How to specify level of compression when using tar -zcvf

GZIP=-9 tar cvzf file.tar.gz /path/to/directory

assuming you're using bash. Generally, set GZIP environment variable to "-9", and run tar normally.

Also - if you really want best compression, don't use gzip. Use lzma or 7z.

And when using gzip (which is good idea for various of reasons anyway) consider using pigz program and not the gzip.