How to display the non-sparse parts of a sparse file

sparse-filestext processing

Imagine a file created with:

truncate -s1T file
echo test >> file
truncate -s2T file

I now have a 2 tebibyte file (that occupies 4kiB on disk), with "test\n" written in the middle.

How would I recover that "test" efficiently, that is without having to read the whole file.

tr -d '\0' < file

Would give me the result but that would take hours.

What I'd like is something that outputs only the non-sparse parts of the file (so above only "test\n" or more likely, the 4kiB block allocated on disk that stores that data).

There are APIs to find out which part of the file are allocated (FIBMAP, FIEMAP, SEEK_HOLE, SEEK_DATA…), but what tools expose those?

A portable solution (at least to the OSes that support those APIs) would be appreciated.

Best Answer

The best I could come up with so far is (ksh93, using filefrag from e2fsprogs 1.42.9 (some older versions have a different API), on extent based file systems on Linux):

#! /bin/ksh93
export LC_ALL=C
for file do
filefrag -vb1 -- "$file" |
  while IFS=": ." read -A a; do
    [[ $a = +([0-9]) ]] && [[ ${a[@]} != *unwritten* ]] &&
      command /opt/ast/bin/head -s "${a[1]}" -c "${a[7]}" -- "$file"
  done
done

filefrag reports the extents of the file using the FIEMAP ioctl for the filesystems that support it.

The *unwritten* part covers for the (non-sparse, but still full of zeros I'm not interested in) files that have been fallocated but not written to.

Recent versions of bsdtar or star can use some of those APIs to generate a tar file that identifies the sparse sections as such. That would make for a more portable solution, but then one would have to parse the generated tar file to get the non-sparse sections.

Edit 2015

as of util-linux 2.25, the fallocate utility on Linux has a -d/--dig-hole option for that.

fallocate -d the-file

Would dig a hole for every block full of zeros in the file

On older systems, you can do it by hand:

Linux has a FALLOC_FL_PUNCH_HOLE option to fallocate that can do this. I found a script on github with an example:

Using FALLOC_FL_PUNCH_HOLE from Python

I modified it a bit to do what you asked -- punch holes in regions of files that are filled with zeros. Here it is:

Using FALLOC_FL_PUNCH_HOLE from Python to punch holes in files

usage: punch.py [-h] [-v VERBOSE] FILE [FILE ...]

Punch out the empty areas in a file, making it sparse

positional arguments:
  FILE                  file(s) to modify in-place

optional arguments:
  -h, --help            show this help message and exit
  -v VERBOSE, --verbose VERBOSE
                        be verbose

Example:

# create a file with some data, a hole, and some more data
$ dd if=/dev/urandom of=test1 bs=4096 count=1 seek=0
$ dd if=/dev/urandom of=test1 bs=4096 count=1 seek=2

# see that it has holes
$ du --block-size=1 --apparent-size test1
12288   test1
$ du --block-size=1 test1
8192    test1

# copy it, ignoring the hole
$ cat test1 > test2
$ du --block-size=1 --apparent-size test2
12288   test2
$ du --block-size=1 test2
12288    test2

# punch holes again
$ ./punch.py test2
$ du --block-size=1 --apparent-size test2
12288   test2
$ du --block-size=1 test2
8192    test2

# verify
$ cmp test1 test2 && echo "files are the same"
files are the same

Note that punch.py only finds blocks of 4096 bytes to punch out, so it might not make a file exactly as sparse as it was when you started. It could be made smarter, of course. Also, it's only lightly tested, so be careful and make backups before trusting it!

Best Answer

Related Solutions

Linux – Detailed sparse file information on Linux

Can a file that was originally sparse and then expanded be made sparse again

Edit 2015

Related Question