GNU Parallel: How to limit max running network jobs per host

gnu-parallellimit

I'm using GNU Parallel for scanning a list of urls (from different hosts) for vulnerability, like this:

cat urls.txt | parallel --gnu -j 50 ./scan {}

The 'scan' program works for one url in one thread. And I need to hard-limit the number of simultaneous requests (jobs) to each host, for example to 5 connections. How can I achieve this?

Best Answer

Split urls into one file per host. Then run 'parallel -j5' on each file.

Or sort urls and insert a delimiter '\0' when a new host is met, then split on '\0' and remove '\0' while passing that as a block to a new instance of parallel:

sort urls.txt | 
  perl -pe '(not m://$last:) and print "\0";m://([^/]+): and $last=$1' |
  parallel -j10 --pipe --rrs -N1 --recend '\0' parallel -j5 ./scan

Edit:

I think this will work:

cat urls.txt | parallel -q -j50 sem --fg --id '{= m://([^/]+):; $_=$1 =}' -j5 ./scan {}

sem is part of GNU Parallel (it is a shorthand for parallel --semaphore). {= m://([^/]+):; $_=$1 =} grabs the hostname. -j5 tells sem to make a counting semaphore with 5 slots. --fg forces sem to not spawn the job in the background. By using the hostname as ID you will get a counting semaphore for each hostname.

-q is needed for parallel if some of your URLs contain special shell chars (such as &). They need to be protected from shell expansion because sem will also shell expand them.

Related Question