Linux – Spawning multiple parallel wgets and storing results in a bash array to be pretty printed when all wgets are done

I have a long list of urls on my own website listed in a carriage return seperated text file. So for instance:

http:/www.mysite.com/url1.html
http:/www.mysite.com/url2.html
http:/www.mysite.com/url3.html

I need to spawn a number of parallel wgets to hit each URL twice, check for and retrieve a particular header and then save the results in an array which I want to output in a nice report.

I have part of what I want by using the following xargs command:

xargs -x -P 20 -n 1 wget --server-response -q -O - --delete-after<./urls.txt 2>&1 | grep Caching

The question is how do I run this command twice and store the following:

The URL hit
The 1st result of the grep against the Caching header
The 2nd result of the grep against the Caching header

So the output should look something like:

=====================================================
http:/www.mysite.com/url1.html
=====================================================
First Hit: Caching: MISS
Second Hit: Caching: HIT

=====================================================
http:/www.mysite.com/url2.html
=====================================================
First Hit: Caching: MISS
Second Hit: Caching: HIT

And so forth.

Order that the URLS appear isn't necessarily a concern as long as the header(s) are associated with the URL.

Because of the number of URLs I need to hit multiple URLs in parallel not serially otherwise it will take way too long.

The trick is how do I get multiple parallel wgets AND store the results in a meaningful way. I'm not married to using an array if there is a more logical way of doing this (maybe writing to a log file?)

Do any bash gurus have any suggestions for how I might proceed?

#!/bin/bash url=$1 echo "======================================="; echo "$url" echo "======================================="; echo -n "First Hit: Caching: "; wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null if [ $? == 0 ]; then echo HIT; else echo MISS; fi; echo -n "Second Hit: Caching: "; wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null if [ $? == 0 ]; then echo HIT; else echo MISS; fi; echo "";

Best Answer

Make a small script that does the right thing given a single url (based on terdon's code):

Then run this script in parallel (say, 500 jobs at a time) using GNU Parallel:

cat urls.txt | parallel -j500 my_script

GNU Parallel will make sure the output from two processes are never mixed - a guarantee xargs does not give.

You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/

You can install GNU Parallel in just 10 seconds with:

wget -O - pi.dk/3 | sh

Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Best Answer

Related Solutions

Linux – Parallel processing slower than sequential

Linux – Bash: How to copy a file to multiple SSH servers all of which are specified in a list text file

Related Question