I would like to find the largest file in each directory recursively

filesfindsize;

The output would include the directory name, file name and file size. One (largest file) for each directory from where the command is run.

If possible the average size of the files in that directory as well.

The purpose is to can the directories looking for files that are much larger than the others in the directory so they can be replaced

Best Answer

Combining find and awk allows the averages to be calculated too:

find . -type f -printf '%s %h/%f\0'|awk 'BEGIN { RS="\0" } { SIZE=$1; for (i = 1; i <= NF - 1; i++) $i = $(i + 1); NF = NF - 1; DIR=$0; gsub("/[^/]+$", "", DIR); FILE=substr($0, length(DIR) + 2); SUMSIZES[DIR] += SIZE; NBFILES[DIR]++; if (SIZE > MAXSIZE[DIR] || !BIGGESTFILE[DIR]) { MAXSIZE[DIR] = SIZE; BIGGESTFILE[DIR] = FILE } }; END { for (DIR in SUMSIZES) { printf "%s: average %f, biggest file %s %d\n", DIR, SUMSIZES[DIR] / NBFILES[DIR], BIGGESTFILE[DIR], MAXSIZE[DIR] } }'

Laid out in a more readable manner, the AWK script is

BEGIN { RS="\0" }

{
  SIZE=$1
  for (i = 1; i <= NF - 1; i++) $i = $(i + 1)
  NF = NF - 1
  DIR=$0
  gsub("/[^/]+$", "", DIR)
  FILE=substr($0, length(DIR) + 2)
  SUMSIZES[DIR] += SIZE
  NBFILES[DIR]++
  if (SIZE > MAXSIZE[DIR] || !BIGGESTFILE[DIR]) {
    MAXSIZE[DIR] = SIZE
    BIGGESTFILE[DIR] = FILE
  }
}

END {
  for (DIR in SUMSIZES) {
    printf "%s: average %f, biggest file %s %d\n", DIR, SUMSIZES[DIR] / NBFILES[DIR], BIGGESTFILE[DIR], MAXSIZE[DIR]
  }
}

This expects null-separated input records (I stole this from muru’s answer); for each input record, it

stores the size (for later use),
removes everything before the first character in the path (so we at least handle filenames with spaces correctly),
extracts the directory,
extracts the filename,
adds the size we stored earlier to the sum of sizes in the directory,
increments the number of files in the directory (so we can calculate the average),
if the size is larger than the stored maximum size for the directory, or if we haven’t seen a file in the directory yet, updates the information for the biggest file.

Once all that’s done, the script loops over the keys in SUMSIZES and outputs the directory, average size, largest file’s name and size.

You can pipe the output into sort to sort by directory name. If you want to additionally format the sizes in human-friendly form, you can change the printf line to

printf "%.2f %d %s: %s\n", SUMSIZES[DIR] / NBFILES[DIR], MAXSIZE[DIR], DIR, BIGGESTFILE[DIR]

and then pipe the output into numfmt --field=1,2 --to=iec. You can still sort the result by directory name, you just need to sort starting with the third field: sort -k3.

Related Solutions

Remove all files with a prefix except the one of the largest size

You can use a combination of few utilities:

stat -c '%s %n' pre_* | sort -k1,1rn | tail -n +2 | cut -d' ' -f2 | xargs rm

Assuming GNU system and no unusual filenames.

stat gets the filesize and name separated by space for all pre_* files
sort sorts the file according to the file size, with highest sized one goes to top
tail -n +2 gets the rest of the files apart from the max sized one
cut -d' ' -f2 gets the file name only, and rm (xargs rm) does the removal

Bash – How to Measure Disk Usage of Specific File Types Recursively

To count the disk usage as opposed to the sum of the apparent size, you'd need to use %b¹ instead of %s and make sure each file is counted only once, so something like:

LC_ALL=C find . -iname '*.py' -type f -printf '%D:%i\0%b\0%h\0' |
  gawk -v 'RS=\0' -v OFS='\t' -v max=50 '
    {
      inum = $0
      getline du
      getline dir
    }
    ! seen[inum]++ {
      gsub(/\\/, "&&", dir)
      gsub(/\n/, "\\n", dir)
      sum[dir] += du
    }
    END {
      n = 0
      PROCINFO["sorted_in"] = "@val_num_desc"
      for (dir in sum) {
        print sum[dir] * 512, dir
        if (++n >= max) break
      }
    }' | numfmt --to=iec-i --suffix=B --delimiter=$'\t'

Newlines in the dir names are rendered as \n, and backslashes (at least those decoded as such in the current locale²) as \\.

If a file is found in more than one directory, it is counted against the first one it is found in (order is not deterministic).

It assumes there's no POSIXLY_CORRECT variable in the environment (if there is, setting PROCINFO["sorted_in"] has no effect in gawk so the list would not be sorted). If you can't guarantee it³, you can always start gawk as env -u POSIXLY_CORRECT gawk ... (assuming GNU env or compatible; or (unset -v POSIXLT_CORRECT; gawk ...)).

A few other problems with your approach:

without LC_ALL=C, GNU find wouldn't report the files whose name doesn't form valid characters in the locale, so you could miss some files.
Embedding {} in the code of sh constituted an arbitrary code injection vulnerability. Think for instance of a file called $(reboot).py. You should never do that, the paths to the files should be passed as extra arguments and referenced within the code using positional parameters.
echo can't be used to display arbitrary data (especially with -e which doesn't make sense here). Use printf instead.
With xargs -r0 du -sch, du may be invoked several times if the list of files is big, and in that case, the last line will only include the total for the last run.

^{¹ %b reports disk usage in number of 512-byte units. 512 bytes is the minimum granularity for disk allocation as that's the size of a traditional sector. There's also %k which is int(%b / 2), but that would give incorrect results on filesystems that have 512 byte blocks (file system blocks are generally a power of 2 and at least 512 byte large)}

^{² Using LC_ALL=C for gawk as well would make it a bit more efficient, but would possibly mangle the output in locales using BIG5 or GB18030 charsets (and the file names are also encoded in that charset) as the encoding of backslash is also found in the encoding of some other characters there.}

^{³ Beware that if your sh is bash, POSIXLY_CORRECT is set to y in sh scripts, and it is exported to the environment if sh is started with -a or -o allexport, so that variable can also creep in unintentionally.}

Best Answer

Related Solutions

Remove all files with a prefix except the one of the largest size

Bash – How to Measure Disk Usage of Specific File Types Recursively

Related Question