How to display the non-sparse parts of a sparse file

sparse-filestext processing

Imagine a file created with:

truncate -s1T file
echo test >> file
truncate -s2T file

I now have a 2 tebibyte file (that occupies 4kiB on disk), with "test\n" written in the middle.

How would I recover that "test" efficiently, that is without having to read the whole file.

tr -d '\0' < file

Would give me the result but that would take hours.

What I'd like is something that outputs only the non-sparse parts of the file (so above only "test\n" or more likely, the 4kiB block allocated on disk that stores that data).

There are APIs to find out which part of the file are allocated (FIBMAP, FIEMAP, SEEK_HOLE, SEEK_DATA…), but what tools expose those?

A portable solution (at least to the OSes that support those APIs) would be appreciated.

Best Answer

The best I could come up with so far is (ksh93, using filefrag from e2fsprogs 1.42.9 (some older versions have a different API), on extent based file systems on Linux):

#! /bin/ksh93
export LC_ALL=C
for file do
filefrag -vb1 -- "$file" |
  while IFS=": ." read -A a; do
    [[ $a = +([0-9]) ]] && [[ ${a[@]} != *unwritten* ]] &&
      command /opt/ast/bin/head -s "${a[1]}" -c "${a[7]}" -- "$file"
  done
done

filefrag reports the extents of the file using the FIEMAP ioctl for the filesystems that support it.

The *unwritten* part covers for the (non-sparse, but still full of zeros I'm not interested in) files that have been fallocated but not written to.

Recent versions of bsdtar or star can use some of those APIs to generate a tar file that identifies the sparse sections as such. That would make for a more portable solution, but then one would have to parse the generated tar file to get the non-sparse sections.

Related Question