How to speed up operations on sparse files with tar, gzip, rsync

rsyncsparse-filestar

I have a sparse file. (du -h reports 3G and du -h --apparent-size reports 100G.) So far, so good.

Now, when I want to compress the file using tar or send it over the network using rsync, it will require as much time as 3G. It seems these tools read all the zeros.

I thought the holes are somehow marked and these tools could somehow just skip them?

There is likely no issue with my file?

Is this a missing feature in tar and rsync to not look for sparse files? I used the tar parameter --sparse, but that didn't speed up things. Neither did rsync parameter --sparse.

Is there any way to speed these tools up on sparse files?

Best Answer

bsdtar (at least from libarchive 3.1.2) is able to detect sparse sections using the FS_IOC_FIEMAP ioctl on the file systems that support it (though it supports a number of other APIs as well), however, at least in my test, strangely enough, it is not able to handle the tar files it generates itself (looks like a bug though).

However using GNU tar to extract them works, but then GNU tar can't handle some of the extended attributes that bsdtar supports.

bsdtar cf - sparse-files | (cd elsewhere && tar xpf -)

works as long as the files don't have extended attributes or flags.

It still doesn't work for files that are fully sparse (only zeros) as the FS_IOC_FIEMAP ioctl then returns 0 extent and it looks like bsdtar doesn't handle that properly (another bug?).

star (Schily tar) is another opensource tar implementation that can detect sparse files (use the -sparse option) and doesn't have those bugs of bsdtar (but is not packaged by many systems).

Related Solutions

rsync – Is There Any Speed Benefit of Using tar + rsync + untar Over Just rsync?

When you send the same set of files, rsync is better suited because it will only send differences. tar will always send everything and this is a waste of resources when a lot of the data are already there. The tar + rsync + untar loses this advantage in this case, as well as the advantage of keeping the folders in-sync with rsync --delete.

If you copy the files for the first time, first packeting, then sending, then unpacking (AFAIK rsync doesn't take piped input) is cumbersome and always worse than just rsyncing, because rsync won't have to do any task more than tar anyway.

Tip: rsync version 3 or later does incremental recursion, meaning it starts copying almost immediately before it counts all files.

Tip2: If you use rsync over ssh, you may also use either tar+ssh

tar -C /src/dir -jcf - ./ | ssh user@server 'tar -C /dest/dir -jxf -'

or just scp

scp -Cr srcdir user@server:destdir

General rule, keep it simple.

UPDATE:

I've created 59M demo data

mkdir tmp; cd tmp
for i in {1..5000}; do dd if=/dev/urandom of=file$i count=1 bs=10k; done

and tested several times the file transfer to a remote server (not in the same lan), using both methods

time rsync -r  tmp server:tmp2

real    0m11.520s
user    0m0.940s
sys     0m0.472s

time (tar cf demo.tar tmp; rsync demo.tar server: ; ssh server 'tar xf demo.tar; rm demo.tar'; rm demo.tar)

real    0m15.026s
user    0m0.944s
sys     0m0.700s

while keeping separate logs from the ssh traffic packets sent

wc -l rsync.log rsync+tar.log 
   36730 rsync.log
   37962 rsync+tar.log
   74692 total

In this case, I can't see any advantage in less network traffic by using rsync+tar, which is expected when the default mtu is 1500 and while the files are 10k size. rsync+tar had more traffic generated, was slower for 2-3 seconds and left two garbage files that had to be cleaned up.

I did the same tests on two machines on the same lan, and there the rsync+tar did much better times and much much less network traffic. I assume cause of jumbo frames.

Maybe rsync+tar would be better than just rsync on a much larger data set. But frankly I don't think it's worth the trouble, you need double space in each side for packing and unpacking, and there are a couple of other options as I've already mentioned above.

Why is fragmentation level so huge in files that contain other filesystems

It looks like this was a bug in mke2fs that caused it to use fallocate(fd, PUNCH_HOLE, ...) instead of fallocate(fd, DISCARD_ZERO, ...) when zeroing out the space in the inode tables (even when -E nodiscard was used).

I submitted a bug report to the upstream linux-ext4@vger.kernel.org mailing list after verifying this behaviour locally, and got a patch within an hour, subject:

e2fprogs: block zero/discard cleanups

They should be included into the e2fsprogs-1.45 release, and likely the 1.44.x maintenance release. If you want them in a vendor e2fsprogs release, I'd recommend to patch+build your e2fsprogs to verify this is working for you, report success to linux-ext4 so that the patches will land sooner, then submit a bug report to your distro of choice so they pull the upstream patches into their releases.

Best Answer

Related Solutions

rsync – Is There Any Speed Benefit of Using tar + rsync + untar Over Just rsync?

Why is fragmentation level so huge in files that contain other filesystems

Related Question