Find – How to Find Total Filesize Grouped by Extension

disk-usagefindls

I work on a cluster shared with other colleagues. The hard disk is limited (and has been full on some occasions), so I clean up my part occasionally. I want to do this quickly, so until now I do this by making a list of files larger than 100 MB older than 3 months, and I see if I still need them.

But now I am thinking that there could be a folder with >1000 smaller files that I miss, so I want to get an easy way to see if this is the case. From the way I generate data, it would help to get a list of total size per extension. In the context of this question, 'extension' as everything behind the last dot in the filename.

Suppose I have multiple folders with multiple files:

folder1/file1.bmp   40 kiB
folder1/file2.jpg   20 kiB
folder2/file3.bmp   30 kiB
folder2/file4.jpg    8 kiB

Is it possible to make a list of total filesize per file extension, so like this:

bmp 70 kiB
jpg 28 kiB

I don't care about files without extension, so they can be ignored or put in one category.

I already went through man pages of ls, du and find, but I don't know what is the right tool for this job…

Best Answer

On a GNU system:

find . -name '?*.*' -type f -printf '%b.%f\0' |
  awk -F . -v RS='\0' '
    {s[$NF] += $1; n[$NF]++}
    END {for (e in s) printf "%15d %4d %s\n", s[e]*512, n[e], e}' |
  sort -n

Or the same with perl, avoiding the -printf extension of GNU find (still using a GNU extension, -print0, but this one is more widely supported nowadays):

find . -name '?*.*' -type f -print0 |
  perl -0ne '
    if (@s = stat$_){
      ($ext = $_) =~ s/.*\.//s;
      $s{$ext} += $s[12];
      $n{$ext}++;
    }
    END {
      for (sort{$s{$a} <=> $s{$b}} keys %s) {
        printf "%15d %4d %s\n",  $s{$_}<<9, $n{$_}, $_;
      }
    }'

It gives an output like:

          12288    1 pnm
          16384    4 gif
         204800    2 ico
        1040384   17 jpg
        2752512   83 png

If you want KiB, MiB... suffixes, pipe to numfmt --to=iec-i --suffix=B.

%b*512 gives the disk usage, but note that if files are hard linked several times, they will be counted several times so you may see a discrepancy with what du reports.

Related Question