Find – How to Find Total Filesize Grouped by Extension

disk-usagefindls

I work on a cluster shared with other colleagues. The hard disk is limited (and has been full on some occasions), so I clean up my part occasionally. I want to do this quickly, so until now I do this by making a list of files larger than 100 MB older than 3 months, and I see if I still need them.

But now I am thinking that there could be a folder with >1000 smaller files that I miss, so I want to get an easy way to see if this is the case. From the way I generate data, it would help to get a list of total size per extension. In the context of this question, 'extension' as everything behind the last dot in the filename.

Suppose I have multiple folders with multiple files:

folder1/file1.bmp   40 kiB
folder1/file2.jpg   20 kiB
folder2/file3.bmp   30 kiB
folder2/file4.jpg    8 kiB

Is it possible to make a list of total filesize per file extension, so like this:

bmp 70 kiB
jpg 28 kiB

I don't care about files without extension, so they can be ignored or put in one category.

I already went through man pages of ls, du and find, but I don't know what is the right tool for this job…

Best Answer

On a GNU system:

find . -name '?*.*' -type f -printf '%b.%f\0' |
  awk -F . -v RS='\0' '
    {s[$NF] += $1; n[$NF]++}
    END {for (e in s) printf "%15d %4d %s\n", s[e]*512, n[e], e}' |
  sort -n

Or the same with perl, avoiding the -printf extension of GNU find (still using a GNU extension, -print0, but this one is more widely supported nowadays):

find . -name '?*.*' -type f -print0 |
  perl -0ne '
    if (@s = stat$_){
      ($ext = $_) =~ s/.*\.//s;
      $s{$ext} += $s[12];
      $n{$ext}++;
    }
    END {
      for (sort{$s{$a} <=> $s{$b}} keys %s) {
        printf "%15d %4d %s\n",  $s{$_}<<9, $n{$_}, $_;
      }
    }'

It gives an output like:

          12288    1 pnm
          16384    4 gif
         204800    2 ico
        1040384   17 jpg
        2752512   83 png

If you want KiB, MiB... suffixes, pipe to numfmt --to=iec-i --suffix=B.

%b*512 gives the disk usage, but note that if files are hard linked several times, they will be counted several times so you may see a discrepancy with what du reports.

Example

This will list the size of all the files along with a summary total.

$ find -maxdepth 2 -type f | tr '\n' '\0' | du -ch --files0-from=- | tail -10
0   ./92086/2.txt
0   ./92086/5.txt
0   ./92086/14.txt
0   ./92086/19.txt
0   ./92086/18.txt
0   ./92086/17.txt
4.0K    ./load.bash
4.0K    ./100855/plain.txt
4.0K    ./100855/tst_ccmds.bash
21M total

NOTE: This solution requires that du support the --files0-from= switch which is a GNU switch, to my knowledge.

excerpt from du man page

--files0-from=F
          summarize disk usage of the NUL-terminated file names specified in 
          file F; If F is - then read names from standard input

Also this method suffers from not being able to deal with special characters in file names, such as spaces and non-printables.

Examples

du: cannot access `./101415/fileD': No such file or directory
du: cannot access `E': No such file or directory

These could be dealt with by introducing more tr .. .. commands to substitute them with alternative characters. However there is a better way, if you have access to GNU's find.

Improvements

If your version of find offers the --print0 switch then you can use this incantation which deals with files that have spaces and/or special characters that aren't printable.

$ find -maxdepth 2 -type f -print0 | du -ch --files0-from=- | tail -10
0   ./92086/2.txt
0   ./92086/5.txt
0   ./92086/14.txt
0   ./92086/19.txt
0   ./92086/18.txt
0   ./92086/17.txt
4.0K    ./load.bash
4.0K    ./100855/plain.txt
4.0K    ./100855/tst_ccmds.bash
21M total

Best Answer

Related Solutions

Find – How to Omit Extension with Find Command

Bash Find Command – Report Total Size with ls

Example

Examples

Improvements

Related Question