I would like to find the largest file in each directory recursively

filesfindsize;

The output would include the directory name, file name and file size. One (largest file) for each directory from where the command is run.

If possible the average size of the files in that directory as well.

The purpose is to can the directories looking for files that are much larger than the others in the directory so they can be replaced

Best Answer

Combining find and awk allows the averages to be calculated too:

find . -type f -printf '%s %h/%f\0'|awk 'BEGIN { RS="\0" } { SIZE=$1; for (i = 1; i <= NF - 1; i++) $i = $(i + 1); NF = NF - 1; DIR=$0; gsub("/[^/]+$", "", DIR); FILE=substr($0, length(DIR) + 2); SUMSIZES[DIR] += SIZE; NBFILES[DIR]++; if (SIZE > MAXSIZE[DIR] || !BIGGESTFILE[DIR]) { MAXSIZE[DIR] = SIZE; BIGGESTFILE[DIR] = FILE } }; END { for (DIR in SUMSIZES) { printf "%s: average %f, biggest file %s %d\n", DIR, SUMSIZES[DIR] / NBFILES[DIR], BIGGESTFILE[DIR], MAXSIZE[DIR] } }'

Laid out in a more readable manner, the AWK script is

BEGIN { RS="\0" }

{
  SIZE=$1
  for (i = 1; i <= NF - 1; i++) $i = $(i + 1)
  NF = NF - 1
  DIR=$0
  gsub("/[^/]+$", "", DIR)
  FILE=substr($0, length(DIR) + 2)
  SUMSIZES[DIR] += SIZE
  NBFILES[DIR]++
  if (SIZE > MAXSIZE[DIR] || !BIGGESTFILE[DIR]) {
    MAXSIZE[DIR] = SIZE
    BIGGESTFILE[DIR] = FILE
  }
}

END {
  for (DIR in SUMSIZES) {
    printf "%s: average %f, biggest file %s %d\n", DIR, SUMSIZES[DIR] / NBFILES[DIR], BIGGESTFILE[DIR], MAXSIZE[DIR]
  }
}

This expects null-separated input records (I stole this from muru’s answer); for each input record, it

  • stores the size (for later use),
  • removes everything before the first character in the path (so we at least handle filenames with spaces correctly),
  • extracts the directory,
  • extracts the filename,
  • adds the size we stored earlier to the sum of sizes in the directory,
  • increments the number of files in the directory (so we can calculate the average),
  • if the size is larger than the stored maximum size for the directory, or if we haven’t seen a file in the directory yet, updates the information for the biggest file.

Once all that’s done, the script loops over the keys in SUMSIZES and outputs the directory, average size, largest file’s name and size.

You can pipe the output into sort to sort by directory name. If you want to additionally format the sizes in human-friendly form, you can change the printf line to

printf "%.2f %d %s: %s\n", SUMSIZES[DIR] / NBFILES[DIR], MAXSIZE[DIR], DIR, BIGGESTFILE[DIR]

and then pipe the output into numfmt --field=1,2 --to=iec. You can still sort the result by directory name, you just need to sort starting with the third field: sort -k3.

Related Question