You can use a combination of few utilities:
stat -c '%s %n' pre_* | sort -k1,1rn | tail -n +2 | cut -d' ' -f2 | xargs rm
Assuming GNU system and no unusual filenames.
stat
gets the filesize and name separated by space for all pre_*
files
sort
sorts the file according to the file size, with highest sized one goes to top
tail -n +2
gets the rest of the files apart from the max sized one
cut -d' ' -f2
gets the file name only, and rm
(xargs rm
) does the removal
To count the disk usage as opposed to the sum of the apparent size, you'd need to use %b
¹ instead of %s
and make sure each file is counted only once, so something like:
LC_ALL=C find . -iname '*.py' -type f -printf '%D:%i\0%b\0%h\0' |
gawk -v 'RS=\0' -v OFS='\t' -v max=50 '
{
inum = $0
getline du
getline dir
}
! seen[inum]++ {
gsub(/\\/, "&&", dir)
gsub(/\n/, "\\n", dir)
sum[dir] += du
}
END {
n = 0
PROCINFO["sorted_in"] = "@val_num_desc"
for (dir in sum) {
print sum[dir] * 512, dir
if (++n >= max) break
}
}' | numfmt --to=iec-i --suffix=B --delimiter=$'\t'
Newlines in the dir names are rendered as \n
, and backslashes (at least those decoded as such in the current locale²) as \\
.
If a file is found in more than one directory, it is counted against the first one it is found in (order is not deterministic).
It assumes there's no POSIXLY_CORRECT
variable in the environment (if there is, setting PROCINFO["sorted_in"]
has no effect in gawk
so the list would not be sorted). If you can't guarantee it³, you can always start gawk
as env -u POSIXLY_CORRECT gawk ...
(assuming GNU env
or compatible; or (unset -v POSIXLT_CORRECT; gawk ...)
).
A few other problems with your approach:
- without
LC_ALL=C
, GNU find
wouldn't report the files whose name doesn't form valid characters in the locale, so you could miss some files.
- Embedding
{}
in the code of sh
constituted an arbitrary code injection vulnerability. Think for instance of a file called $(reboot).py
. You should never do that, the paths to the files should be passed as extra arguments and referenced within the code using positional parameters.
echo
can't be used to display arbitrary data (especially with -e
which doesn't make sense here). Use printf
instead.
- With
xargs -r0 du -sch
, du
may be invoked several times if the list of files is big, and in that case, the last line will only include the total for the last run.
¹ %b
reports disk usage in number of 512-byte units. 512 bytes is the minimum granularity for disk allocation as that's the size of a traditional sector. There's also %k
which is int(%b / 2)
, but that would give incorrect results on filesystems that have 512 byte blocks (file system blocks are generally a power of 2 and at least 512 byte large)
² Using LC_ALL=C
for gawk as well would make it a bit more efficient, but would possibly mangle the output in locales using BIG5 or GB18030 charsets (and the file names are also encoded in that charset) as the encoding of backslash is also found in the encoding of some other characters there.
³ Beware that if your sh
is bash
, POSIXLY_CORRECT
is set to y
in sh
scripts, and it is exported to the environment if sh
is started with -a
or -o allexport
, so that variable can also creep in unintentionally.
Best Answer
Combining
find
andawk
allows the averages to be calculated too:Laid out in a more readable manner, the AWK script is
This expects null-separated input records (I stole this from muru’s answer); for each input record, it
Once all that’s done, the script loops over the keys in
SUMSIZES
and outputs the directory, average size, largest file’s name and size.You can pipe the output into
sort
to sort by directory name. If you want to additionally format the sizes in human-friendly form, you can change theprintf
line toand then pipe the output into
numfmt --field=1,2 --to=iec
. You can still sort the result by directory name, you just need to sort starting with the third field:sort -k3
.