I have a long list of urls on my own website listed in a carriage return seperated text file. So for instance:
- http:/www.mysite.com/url1.html
- http:/www.mysite.com/url2.html
- http:/www.mysite.com/url3.html
I need to spawn a number of parallel wgets to hit each URL twice, check for and retrieve a particular header and then save the results in an array which I want to output in a nice report.
I have part of what I want by using the following xargs command:
xargs -x -P 20 -n 1 wget --server-response -q -O - --delete-after<./urls.txt 2>&1 | grep Caching
The question is how do I run this command twice and store the following:
- The URL hit
- The 1st result of the grep against the Caching header
- The 2nd result of the grep against the Caching header
So the output should look something like:
=====================================================
http:/www.mysite.com/url1.html
=====================================================
First Hit: Caching: MISS
Second Hit: Caching: HIT
=====================================================
http:/www.mysite.com/url2.html
=====================================================
First Hit: Caching: MISS
Second Hit: Caching: HIT
And so forth.
Order that the URLS appear isn't necessarily a concern as long as the header(s) are associated with the URL.
Because of the number of URLs I need to hit multiple URLs in parallel not serially otherwise it will take way too long.
The trick is how do I get multiple parallel wgets AND store the results in a meaningful way. I'm not married to using an array if there is a more logical way of doing this (maybe writing to a log file?)
Do any bash gurus have any suggestions for how I might proceed?
Best Answer
Make a small script that does the right thing given a single url (based on terdon's code):
Then run this script in parallel (say, 500 jobs at a time) using GNU Parallel:
GNU Parallel will make sure the output from two processes are never mixed - a guarantee xargs does not give.
You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/
You can install GNU Parallel in just 10 seconds with:
Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1