Bash – Running multiple instances of perl via xargs

bashcommand line

I have a script dataProcessing.pl that accepts a tab-delimited .txt file and performs extensive processing tasks on the contained data. Multiple input files exist (file1.txt file2.txt file3.txt) which are currently looped over as part of a bash script, that invokes perl during each iteration (i.e. input files are processed one at a time).

I wish however to run multiple instances of Perl (if possible), and process all input files simultaneously via xargs. I'm aware that you can run something akin to:

perl -e 'print "Test" x 100' | xargs -P 100

However I want to pass a different file for each parallel instance of Perl opened (one instance works on file1.txt, one works on file2.txt and so forth). File handle or file path can be passed to Perl as an argument. How can I do this? I am not sure how I would pass the file names to xargs for example.

Best Answer

Use xargs with -n 1 meaning "only pass one single argument to each invocation of the utility".

Something like:

printf '%s\n' file*.txt | xargs -n 1 -P 100 perl dataProcessing.pl

which assumes that the filenames don't contain literal newlines.

If you have GNU xargs, or an implementation of xargs that understands -0 (for reading nul-delimited arguments, which allows for filenames with newlines) and -r (for not running the utility with empty argument list, when file*.txt doesn't match anything and nullglob is in effect), you may do

printf '%s\0' file*.txt | xargs -r0 -n 1 -P 100 perl dataProcessing.pl

Note that both of these variations may start up to 100 parallel instances of the script, which may not be what you want. You may want to limit it to a reasonable number related to the number of CPUs on your machine (or related to the total amount of available RAM divided by the expected memory usage per task, if it's memory bound).

Related Question