Xargs: running command once with all arguments

xargs

My aim is to get a list of files that have been modified in git, then run rspec command passing each file in as an argument.

Currently I have:

$ git status -s | awk '{if ($1 == "M") print $2}' | tr "\\n" "\\0" | \
    xargs -0 -I % rspec -f documentation %

This technically works, but it runs rspec for each modified file, I want it to run:

$ rspec path/to/file/1 path/to/file/2 ...

Anyone know how I can achieve this?

Best Answer

... | xargs -0 rspec -f documentation

Note that this can call rspec multiple times if the command line would be too long. This isn't an issue with rspec since the reason to call it once is performance, but don't use it for something like xargs -0 tar cf archive.tar where any second, third, … run would create an archive overwriting the output of the previous runs.

Related Solutions

Bash – way to make this one-liner faster

There's a part you can easily improve, but it isn't the slowest part.

find /home/mydir/ -type f | sort | \
awk "/xml_20140207_000016.zip/,/xml_20140207_235938.zip/"

This is somewhat wasteful because it first lists all files, then sorts the file names and extracts the interesting ones. The find command has to run to completion before the sorting can begin.

It would be faster to list only the interesting files in the first place, or at least as small a superset as possible. If you need a finer-grained filter on names than find is capable of, pipe into awk, but don't sort: awk and other line-by-line filters can process lines one by one but sort needs the complete input.

find /home/mydir/ -name 'xml_20140207_??????.zip' -type f | \
awk 'match($0, /_[0-9]*.zip$/) &&
     (time = substr($0, RSTART+1, RLENGTH-5)) &&
     time >= 16 && time <= 235938' |
xargs -n 1 -P 10 zipgrep "my search string"

The part which is most obviously suboptimal is zipgrep. Here there is no easy way to improve performance because of the limitations of shell programming. The zipgrep script operates by listing the file names in the archive, and calling grep on each file's content, one by one. This means that the zip archive is parsed again and again for each file. A Java program (or Perl, or Python, or Ruby, etc.) can avoid this by processing the file only once.

If you want to stick to shell programming, you can try mounting each zip instead of using zipgrep.

… | xargs -n1 -P2 sh -c '
    mkdir "mnt$$-$1";
    fuse-zip "$1" "mnt$$-$1";
    grep -R "$0" "mnt$$-$1"
    fusermount -u "mnt$$-$1"
' "my search string"

Note that parallelism isn't going to help you much: the limiting factor on most setups will be disk I/O bandwidth, not CPU time.

I haven't benchmarked anything, but I think the biggest place for improvement would be to use a zipgrep implementation in a more powerful language.

Command that prints file contents given filename on stdin

After doing a bit more research I realised that this scenario is exactly what xargs is designed for:

./my-command args | cut -d : -f 5 | xargs cat

Which will transform the output of stdin into an invocation of cat with an actual filename and thus print out the file contents

Best Answer

Related Solutions

Bash – way to make this one-liner faster

Command that prints file contents given filename on stdin

Related Question