Linux – Spawning multiple parallel wgets and storing results in a bash array to be pretty printed when all wgets are done

arraybashlinuxwgetxargs

I have a long list of urls on my own website listed in a carriage return seperated text file. So for instance:

  • http:/www.mysite.com/url1.html
  • http:/www.mysite.com/url2.html
  • http:/www.mysite.com/url3.html

I need to spawn a number of parallel wgets to hit each URL twice, check for and retrieve a particular header and then save the results in an array which I want to output in a nice report.

I have part of what I want by using the following xargs command:

xargs -x -P 20 -n 1 wget --server-response -q -O - --delete-after<./urls.txt 2>&1 | grep Caching

The question is how do I run this command twice and store the following:

  1. The URL hit
  2. The 1st result of the grep against the Caching header
  3. The 2nd result of the grep against the Caching header

So the output should look something like:

=====================================================
http:/www.mysite.com/url1.html
=====================================================
First Hit: Caching: MISS
Second Hit: Caching: HIT

=====================================================
http:/www.mysite.com/url2.html
=====================================================
First Hit: Caching: MISS
Second Hit: Caching: HIT

And so forth.

Order that the URLS appear isn't necessarily a concern as long as the header(s) are associated with the URL.

Because of the number of URLs I need to hit multiple URLs in parallel not serially otherwise it will take way too long.

The trick is how do I get multiple parallel wgets AND store the results in a meaningful way. I'm not married to using an array if there is a more logical way of doing this (maybe writing to a log file?)

Do any bash gurus have any suggestions for how I might proceed?

Best Answer

Make a small script that does the right thing given a single url (based on terdon's code):

#!/bin/bash

url=$1
echo "=======================================";
echo "$url"
echo "=======================================";
echo -n "First Hit: Caching: ";
wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
if [ $? == 0 ]; then echo HIT; else echo MISS; fi;
echo -n "Second Hit: Caching: ";      
wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
if [ $? == 0 ]; then echo HIT; else echo MISS; fi; echo "";

Then run this script in parallel (say, 500 jobs at a time) using GNU Parallel:

cat urls.txt | parallel -j500 my_script

GNU Parallel will make sure the output from two processes are never mixed - a guarantee xargs does not give.

You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/

You can install GNU Parallel in just 10 seconds with:

wget -O - pi.dk/3 | sh 

Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Related Question