Output garbled when running “xargs ls” in parallel

findlssortxargs

I want to list all files in /usr/ using ls. I am not calling ls directly, but via xargs. Moreover, I am using xargs parameters -L and -P to utilize all my cores.

find /usr/ -type f  -print0 | xargs -0 -L16 -P4 ls -lAd  | sort -k9 > /tmp/aaa

the above command works as expected. It produces nice output. However when I increase the number of lines -L parameter from 16 to 64:

find /usr/ -type f  -print0 | xargs -0 -L64 -P4 ls -lAd  | sort -k9 > /tmp/bbb

the resulting output is all garbled up. What I mean by that is, output no longer starts on new line, new lines start in the middle of "previous" line and are all mixed up:

-rw-r--r-- 1 root root  5455 Nov 16  2010 /usr/shareonts/X11/encodings/armscii-8.enc.gz
-rw-r--r-- 1 root root  1285 May 29  2016-rw-r--r-- 1 root root   6205 May 29  2016 /usr/include/arpa/nameser_compat.h
-rw-r--r-- 1 root root       0 Apr 17  20-rw-r--r-- 1 root root   933 Apr 16  2012 /usr/share/icons/nuoveXT2/16x16/actions/address-book-new.png
-rw-r--r-- 1 root root  53651 Jun 17  2012-rw-r--r-- 1 root root  7117 May 29  2016 /usr/include/dlfcn.h
-rw-r--r-- 1 root root  311 Jun  9  2015-rw-r--r-- 1 root root 1700 Jun  9  2015 /usr/share/cups/templates/de/add-printer.tmpl
-rw-r--r-- 1 root root  5157 M1 root root 10620 Jun 14  2012 /usr/lib/perl5/Tk/pTk/tkIntXlibDecls.m
-rw-r--r-- 1 root -rwxr-xr-x 1 root root    1829 Jan 22  2013 /usr/lib/emacsen-common/packages/install/dictionaries-common
-rw-r--r-- 1 root r-rw-r--r-- 1 root root  1890 Jun  2  2012 /usr/share/perl5/Date/Manip/TZ/afaddi00.pm
-rw-r--r-- 1 root root 1104 Jul-rw-r--r-- 1 root root  10268 Jul 27 15:58 /usr/share/perl/5.14.2/B/Debug.pm
-rw-r--r-- 1 root root  725 Apr  1-rw-r--r-- 1 root root  883 Apr  1  2012 /usr/share/icons/gnome/16x16/actions/address-book-new.png

Funny thing is, it only happens when using -L64 or larger. I don't see this problem with -L16.

Can anybody explain what is happening here?

Best Answer

This is to do with writes to pipes. With -L16 you are running one process for each 16 files, which produces about a thousand characters, depending on how long the filenames are. With -L64 you are about four thousand. The ls program almost certainly uses the stdio library, and almost certainly uses a 4kB buffer for outputting to reduce the number of write calls.

So find produces a load of filenames, then (for the -L64 case) xargs chops them into bundles of 64 and starts up 4 ls processes to handle them. Each ls will generate its first 4k of output and write it to the pipe to sort. Note that this 4k will typically not end with a newline. So say the third ls gets its first 4kB ready first, and it ends

 lrwxrwxrwx 1 root root       6 Oct 21  2013 bzegrep -> bzgrep
 -rwxr-xr-x 1 root root    4877 Oct 21  2013 bzexe
 lrwxrwxrwx 1 root root       6 Oct 2

and then the first ls outputs something, e.g.

 total 123459

then the input to sort will include lrwxrwxrwx 1 root root 6 Oct 2total 123459

In the -L16 case, the ls processes will (usually) only output a complete set of results in one go.

Of course for this case you are just wasting time and resources by using xargs and ls, you should just let find output the information it already has rather than running extra programs to discover the information again.

Examples that "work"

These code snippets will produce the desired output.

The paste command:

$ paste -s -d ',' k.txt 
1,2,3

The sed command:

$ sed ':a;N;$!ba;s/\n/,/g' k.txt
1,2,3

$ sed ':a;{N;s/\n/,/};ba' k.txt 
1,2,3

The perl command:

$ perl -00 -p -e 's/\n(?!$)/,/g' k.txt
1,2,3

$ perl -00 -p -e 'chomp;tr/\n/,/' k.txt
1,2,3

The awk command:

$ awk '{printf"%s%s",c,$0;c=","}' k.txt
1,2,3

$ awk '{printf "%s,",$0}' k.txt | awk '{sub(/\,$/,"");print}'
1,2,3

$ awk -vORS=, 1 k.txt | awk '{sub(/\,$/,"");print}'
1,2,3

$ awk 'BEGIN {RS="dn"}{gsub("\n",",");print $0}' k.txt | awk '{sub(/\,$/,"");print}'
1,2,3

The python command:

$ python -c "import sys; print sys.stdin.read().replace('\n', ',')[0:-1]" <k.txt
1,2,3

$ python -c "import sys; print sys.stdin.read().replace('\n', ',').rstrip(',')" <k.txt
1,2,3

Bash's mapfile built-in:

$ mapfile -t a < k.txt; (IFS=','; echo "${a[*]}")
1,2,3

The ruby command:

$ ruby -00 -pe 'gsub /\n/,",";chop' < k.txt
1,2,3

$ ruby -00 -pe '$_.chomp!"\n";$_.tr!"\n",","' k.txt
1,2,3

The php command:

$ php -r 'echo strtr(chop(file_get_contents($argv[1])),"\n",",");' k.txt
1,2,3

Caveats

Most of the examples above will work just fine. Some have hidden issues, such as the PHP example above. The function chop() is actually an alias to rtrim(), so the last line's trailing spaces will also be removed.

So too do does the first Ruby example, and the first Python example. The issue is with how they're all making use of a type of operation that essentially "chops" off, blindly, a trailing character. This is fine in for the example that the OP provided, but care must be taken when using these types of one liners to make sure that they conform with the data they're processing.

Example

Say our sample file, k.txt looked like this instead:

$ echo -en "1\n2\n3" > k.txt

It looks similar but it has one slight difference. It doesn't have a trailing newline (\n) like the original file. Now when we run the first Python example we get this:

$ python -c "import sys; print sys.stdin.read().replace('\n', ',')[0:-1]" <k.txt
1,2,

Examples that "almost" work

These are the "always a bridesmaid, never a bride" examples. Most of them could probably be adapted, but when working a potential solution to a problem, when it feels "forced", it's probably the wrong tool for the job!

The perl command:

$ perl -p -e 's/\n/,/' k.txt
1,2,3,

The tr command:

$ tr '\n' ','  < k.txt 
1,2,3,

The cat + echo commands:

$ echo $(cat k.txt)
1 2 3

The ruby command:

$ ruby -pe '$_["\n"]=","' k.txt
1,2,3,

Bash's while + read built-ins:

$ while read line; do echo -n "$line,"; done < k.txt
1,2,3,

Shell – Properly escaping output from pipe in xargs

The POSIX specification does give you an example for that:

ls | sed -e 's/"/"\\""/g' -e 's/.*/"&"/' | xargs -E '' printf '<%s>\n'

(with filenames being arbitrary sequences of bytes (other than / and NULL) and sed/xargs expecting text, you'd also need to fix the locale to C (where all non-NUL bytes would make valid characters) to make that reliable (except for xargs implementations that have a very low limit on the maximum length of an argument))

The -E '' is needed for some xargs implementations that without it, would understand a _ argument to signify the end of input (where echo a _ b | xargs outputs a only for instance).

With GNU xargs, you can use:

ls | xargs -d '\n' printf '<%s>\n'

GNU xargs has also a -0 that has been copied by a few other implementations, so:

ls | tr '\n' '\0' | xargs -0 printf '<%s>\n'

is slightly more portable.

All of those assume the file names don't contain newline characters. If there may be filenames with newline characters, the output of ls is simply not post-processable. If you get:

a
b

That can be either two a and b files or a file called a<newline>b, there's no way to tell.

GNU ls has a --quoting-style=shell-always which makes its output unambiguous and could be post-processable, but the quoting is not compatible with the quoting expected by xargs. xargs recognise "...", \x and '...' forms of quoting. But both "..." and '...' are strong quotes and can't contain newline characters (only \ can escape newline characters for xargs), so that's not compatible with sh quoting where only '...' are strong quotes (and can contain newline characters) but \<newline> is a line-continuation (is removed) instead of an escaped newline.

You can use the shell to parse that output and then output it in a format expected by xargs:

eval "files=($(ls --quoting-style=shell-always))"
[ "${#files[#]}" -eq 0 ] || printf '%s\0' "${files[@]}" |
  xargs -0 printf '<%s>\n'

Best Answer

Related Solutions

How to Format Output of Xargs Command

Examples that "work"

Examples that "almost" work

Shell – Properly escaping output from pipe in xargs

Related Question