This is my working code, but I believe it's not optimized – there must be a way to complete the job much faster than this:
find . -type f -iname '*.py' -printf '%h\0' |
sort -z -u |
xargs -r -0 -I{} sh -c '
find "{}" -maxdepth 1 -type f -iname "*.py" -print0 |
xargs -r -0 du -sch |
tail -1 |
cut -f1 |
tr "\n" " "
echo -e "{}"' |
sort -k1 -hr |
head -50
The goal is to search for all directories recursively that contain *.py
then print the total size of all *.py
files by the name of each directory, sort them in reverse order by size and show only first 50.
Any ideas how to improve this code (performance-wise) but keeping the same output?
EDIT:
I tested your proposals on the following sample: 47GB total: 5805 files
Unfortunately, I couldn't compare it toe-to-toe, since not all proposals follow the same guidelines: the total size should be disk usage and delimiter should be only a single space. Formatting should be as follows: numfmt --to=iec-i --suffix=B
The following 4 are sorted outputs, but David displays accumulative size of files, not real disk usage. However, his improvement is significant: more than 9.5x faster. Stéphane's and Isaac's code are very tight winners, since their code is approximately 32x faster than the reference code.
$ time madjoe.sh
real 0m2,752s
user 0m3,022s
sys 0m0,785s
$ time david.sh
real 0m0,289s
user 0m0,206s
sys 0m0,131s
$ time isaac.sh
real 0m0,087s
user 0m0,032s
sys 0m0,032s
$ time stephane.sh
real 0m0,086s
user 0m0,013s
sys 0m0,047s
The following code unfortunately doesn't sort nor display largest 50 results (besides, during previous comparison to Isaac's code, the following code is approx 6x slower than Isaac's improvement):
$ time hauke.sh
real 0m0,567s
user 0m0,609s
sys 0m0,122s
Best Answer
To count the disk usage as opposed to the sum of the apparent size, you'd need to use
%b
¹ instead of%s
and make sure each file is counted only once, so something like:Newlines in the dir names are rendered as
\n
, and backslashes (at least those decoded as such in the current locale²) as\\
.If a file is found in more than one directory, it is counted against the first one it is found in (order is not deterministic).
It assumes there's no
POSIXLY_CORRECT
variable in the environment (if there is, settingPROCINFO["sorted_in"]
has no effect ingawk
so the list would not be sorted). If you can't guarantee it³, you can always startgawk
asenv -u POSIXLY_CORRECT gawk ...
(assuming GNUenv
or compatible; or(unset -v POSIXLT_CORRECT; gawk ...)
).A few other problems with your approach:
LC_ALL=C
, GNUfind
wouldn't report the files whose name doesn't form valid characters in the locale, so you could miss some files.{}
in the code ofsh
constituted an arbitrary code injection vulnerability. Think for instance of a file called$(reboot).py
. You should never do that, the paths to the files should be passed as extra arguments and referenced within the code using positional parameters.echo
can't be used to display arbitrary data (especially with-e
which doesn't make sense here). Useprintf
instead.xargs -r0 du -sch
,du
may be invoked several times if the list of files is big, and in that case, the last line will only include the total for the last run.¹
%b
reports disk usage in number of 512-byte units. 512 bytes is the minimum granularity for disk allocation as that's the size of a traditional sector. There's also%k
which isint(%b / 2)
, but that would give incorrect results on filesystems that have 512 byte blocks (file system blocks are generally a power of 2 and at least 512 byte large)² Using
LC_ALL=C
for gawk as well would make it a bit more efficient, but would possibly mangle the output in locales using BIG5 or GB18030 charsets (and the file names are also encoded in that charset) as the encoding of backslash is also found in the encoding of some other characters there.³ Beware that if your
sh
isbash
,POSIXLY_CORRECT
is set toy
insh
scripts, and it is exported to the environment ifsh
is started with-a
or-o allexport
, so that variable can also creep in unintentionally.