Output garbled when running “xargs ls” in parallel

findlssortxargs

I want to list all files in /usr/ using ls. I am not calling ls directly, but via xargs. Moreover, I am using xargs parameters -L and -P to utilize all my cores.

find /usr/ -type f  -print0 | xargs -0 -L16 -P4 ls -lAd  | sort -k9 > /tmp/aaa

the above command works as expected. It produces nice output. However when I increase the number of lines -L parameter from 16 to 64:

find /usr/ -type f  -print0 | xargs -0 -L64 -P4 ls -lAd  | sort -k9 > /tmp/bbb

the resulting output is all garbled up. What I mean by that is, output no longer starts on new line, new lines start in the middle of "previous" line and are all mixed up:

-rw-r--r-- 1 root root  5455 Nov 16  2010 /usr/shareonts/X11/encodings/armscii-8.enc.gz
-rw-r--r-- 1 root root  1285 May 29  2016-rw-r--r-- 1 root root   6205 May 29  2016 /usr/include/arpa/nameser_compat.h
-rw-r--r-- 1 root root       0 Apr 17  20-rw-r--r-- 1 root root   933 Apr 16  2012 /usr/share/icons/nuoveXT2/16x16/actions/address-book-new.png
-rw-r--r-- 1 root root  53651 Jun 17  2012-rw-r--r-- 1 root root  7117 May 29  2016 /usr/include/dlfcn.h
-rw-r--r-- 1 root root  311 Jun  9  2015-rw-r--r-- 1 root root 1700 Jun  9  2015 /usr/share/cups/templates/de/add-printer.tmpl
-rw-r--r-- 1 root root  5157 M1 root root 10620 Jun 14  2012 /usr/lib/perl5/Tk/pTk/tkIntXlibDecls.m
-rw-r--r-- 1 root -rwxr-xr-x 1 root root    1829 Jan 22  2013 /usr/lib/emacsen-common/packages/install/dictionaries-common
-rw-r--r-- 1 root r-rw-r--r-- 1 root root  1890 Jun  2  2012 /usr/share/perl5/Date/Manip/TZ/afaddi00.pm
-rw-r--r-- 1 root root 1104 Jul-rw-r--r-- 1 root root  10268 Jul 27 15:58 /usr/share/perl/5.14.2/B/Debug.pm
-rw-r--r-- 1 root root  725 Apr  1-rw-r--r-- 1 root root  883 Apr  1  2012 /usr/share/icons/gnome/16x16/actions/address-book-new.png

Funny thing is, it only happens when using -L64 or larger. I don't see this problem with -L16.

Can anybody explain what is happening here?

Best Answer

This is to do with writes to pipes. With -L16 you are running one process for each 16 files, which produces about a thousand characters, depending on how long the filenames are. With -L64 you are about four thousand. The ls program almost certainly uses the stdio library, and almost certainly uses a 4kB buffer for outputting to reduce the number of write calls.

So find produces a load of filenames, then (for the -L64 case) xargs chops them into bundles of 64 and starts up 4 ls processes to handle them. Each ls will generate its first 4k of output and write it to the pipe to sort. Note that this 4k will typically not end with a newline. So say the third ls gets its first 4kB ready first, and it ends

 lrwxrwxrwx 1 root root       6 Oct 21  2013 bzegrep -> bzgrep
 -rwxr-xr-x 1 root root    4877 Oct 21  2013 bzexe
 lrwxrwxrwx 1 root root       6 Oct 2

and then the first ls outputs something, e.g.

 total 123459

then the input to sort will include lrwxrwxrwx 1 root root 6 Oct 2total 123459

In the -L16 case, the ls processes will (usually) only output a complete set of results in one go.

Of course for this case you are just wasting time and resources by using xargs and ls, you should just let find output the information it already has rather than running extra programs to discover the information again.

Related Question