I am using GNU parallel and want to understand – how can I get the individual string passed to each parallel command?
As an example, GNU Parallel documentation shows how to move files from the current directory to another:
ls | parallel mv {} destdir
So is there a way to get/print each file individually which was passed to parallel?
Case for parallel processing
I need to do parallel processing of checking multiple sites and record
- http return code (2xx, 4xx, 5xx)
- The source URL
- The ultimate destination URL
- the curl exit code
Here is the code which does this:
unset return_code_array
unset destination_url_array
unset exit_code_array
while read -r return_code_var destination_url_var exit_code_var; do
destination_url_array+=("$destination_url_var")
exit_code_array+=("$exit_code_var")
return_code_array+=("$return_code_var")
done < <(printf '%s\n' "${all_valid_URLs_array[@]}" | parallel -j 20 -k 'curl --max-time 20 -sL -o /dev/null -w "%{response_code} %{url_effective} " {}; printf "%s %s\n" "$?" ')
As a result, I have three arrays and they hold the HTTP return code, ultimate destination URL, and the curl exit code status for each corresponding line for the all_valid_URLs_array
entries. I at the same time need to do some processing for each destination_url_var
– like comparing if it matches to the source URL, but have no idea how to get the string which was passed to parallels.
Currently, I am running a second loop after the above one for such processing but want to know if I want to accomplish is possible.
Thanks.
Best Answer
In your example
'curl … {}; printf "%s %s\n" "$?" '
(why the second%s
?) is a single-quoted piece of shell code. In it you can use{}
more than once:Alternatively create a variable and use it as many times as you want. The name of the variable can be descriptive, this is an advantage. There's another advantage: in general what gets substituted for
{}
can be a long string, substituting it many times may bloat the codeparallel
will pass to particular shells. IMO it's better to substitute once and let the shell save the string and reuse it:In case of GNU
parallel
it's safe to embed{}
in the shell code. It's an exception explicitly mentioned in this answer: Never embed{}
in the shell code!. You probably already know this, the remark is for a general audience.Note you need to adjust your
read
in the main loop, it now has to read into four variables. This way you will transfer the source URL from the inside ofparallel
to the main loop where you can compare it todestination_url_var
or do whatever you want.Still in this approach "whatever you want" is not parallelized.
If you capture the output from
curl
to separate variables inside the shell code run byparallel
(instead of just printing it to be captured outside ofparallel
) then you will be able to do comparison (or whatever you want) there, in parallel. And e.g.printf
conditionally. It's up to you where you implement the desired logic, as long as the inside ofparallel
generates output in the form expected by the outsideread
.The shell code passed to
parallel
still needs to be single-quoted. As it grows, you may need to use (embed) single-quotes in this very code; then quoting will get somewhat complicated and less readable. In such situation consider moving the code to a separate script where you can quote independently. You will invoke it from the main script like this:Inside the
separate_script
the string substituted for{}
will be available as$1
(don't forget to double-quote it).