Bash – How to Measure Disk Usage of Specific File Types Recursively

bashdisk-usagefindgnuperformance

This is my working code, but I believe it's not optimized – there must be a way to complete the job much faster than this:

find . -type f -iname '*.py' -printf '%h\0' |
  sort -z -u |
  xargs -r -0 -I{} sh -c '
    find "{}" -maxdepth 1 -type f -iname "*.py" -print0 |
      xargs -r -0 du -sch |
      tail -1 |
      cut -f1 |
      tr "\n" " "
    echo -e "{}"' |
  sort -k1 -hr |
  head -50

The goal is to search for all directories recursively that contain *.py then print the total size of all *.py files by the name of each directory, sort them in reverse order by size and show only first 50.

Any ideas how to improve this code (performance-wise) but keeping the same output?

EDIT:

I tested your proposals on the following sample: 47GB total: 5805 files
Unfortunately, I couldn't compare it toe-to-toe, since not all proposals follow the same guidelines: the total size should be disk usage and delimiter should be only a single space. Formatting should be as follows: numfmt --to=iec-i --suffix=B

The following 4 are sorted outputs, but David displays accumulative size of files, not real disk usage. However, his improvement is significant: more than 9.5x faster. Stéphane's and Isaac's code are very tight winners, since their code is approximately 32x faster than the reference code.

$ time madjoe.sh
real    0m2,752s
user    0m3,022s
sys     0m0,785s

$ time david.sh 
real    0m0,289s
user    0m0,206s
sys     0m0,131s

$ time isaac.sh 
real    0m0,087s
user    0m0,032s
sys     0m0,032s

$ time stephane.sh 
real    0m0,086s
user    0m0,013s
sys     0m0,047s

The following code unfortunately doesn't sort nor display largest 50 results (besides, during previous comparison to Isaac's code, the following code is approx 6x slower than Isaac's improvement):

$ time hauke.sh 
real    0m0,567s
user    0m0,609s
sys     0m0,122s

Best Answer

To count the disk usage as opposed to the sum of the apparent size, you'd need to use %b¹ instead of %s and make sure each file is counted only once, so something like:

LC_ALL=C find . -iname '*.py' -type f -printf '%D:%i\0%b\0%h\0' |
  gawk -v 'RS=\0' -v OFS='\t' -v max=50 '
    {
      inum = $0
      getline du
      getline dir
    }
    ! seen[inum]++ {
      gsub(/\\/, "&&", dir)
      gsub(/\n/, "\\n", dir)
      sum[dir] += du
    }
    END {
      n = 0
      PROCINFO["sorted_in"] = "@val_num_desc"
      for (dir in sum) {
        print sum[dir] * 512, dir
        if (++n >= max) break
      }
    }' | numfmt --to=iec-i --suffix=B --delimiter=$'\t'

Newlines in the dir names are rendered as \n, and backslashes (at least those decoded as such in the current locale²) as \\.

If a file is found in more than one directory, it is counted against the first one it is found in (order is not deterministic).

It assumes there's no POSIXLY_CORRECT variable in the environment (if there is, setting PROCINFO["sorted_in"] has no effect in gawk so the list would not be sorted). If you can't guarantee it³, you can always start gawk as env -u POSIXLY_CORRECT gawk ... (assuming GNU env or compatible; or (unset -v POSIXLT_CORRECT; gawk ...)).

A few other problems with your approach:

  • without LC_ALL=C, GNU find wouldn't report the files whose name doesn't form valid characters in the locale, so you could miss some files.
  • Embedding {} in the code of sh constituted an arbitrary code injection vulnerability. Think for instance of a file called $(reboot).py. You should never do that, the paths to the files should be passed as extra arguments and referenced within the code using positional parameters.
  • echo can't be used to display arbitrary data (especially with -e which doesn't make sense here). Use printf instead.
  • With xargs -r0 du -sch, du may be invoked several times if the list of files is big, and in that case, the last line will only include the total for the last run.

¹ %b reports disk usage in number of 512-byte units. 512 bytes is the minimum granularity for disk allocation as that's the size of a traditional sector. There's also %k which is int(%b / 2), but that would give incorrect results on filesystems that have 512 byte blocks (file system blocks are generally a power of 2 and at least 512 byte large)

² Using LC_ALL=C for gawk as well would make it a bit more efficient, but would possibly mangle the output in locales using BIG5 or GB18030 charsets (and the file names are also encoded in that charset) as the encoding of backslash is also found in the encoding of some other characters there.

³ Beware that if your sh is bash, POSIXLY_CORRECT is set to y in sh scripts, and it is exported to the environment if sh is started with -a or -o allexport, so that variable can also creep in unintentionally.

Related Question