Wget Directory Options

wget

I have read the Wget manual, but unfortunately it does not seem to address my issue, so I would be most grateful if someone could offer me a bit of assistance.

We have a website, (say) website.com, which links directly to (say) website.com/1/, website.com/2/, … etc.

Now each page website.com/r/, where r is an integer, links to a number of pdf documents. Rather than them being located at website.com/r/doc-i.pdf – which would be convenient – they are all located at website.com/files/doc-i.pdf.

Thus, when I run the command wget -r -l 2 -A pdf website.com, I will of course end up with a big folder named "files", with all the pdf documents contained within it.

I would much prefer, however, that they be organised into different folders named 1, 2, …, n, that correspond to the page from which they were downloaded. Since I will be downloading in total around 10,000 pdf files, I would rather not have to do this manually.

So how do I tell Wget to organise the files, not by the website directory structure, but by the route in which it took to get to the file?

I hope my explanation is clear, and that this is not too difficult to achieve.

Best Answer

(untested) The following needs some tunning, is just a general idea:

### get level1
wget -r -l  website.com/      

#### for each html file otained,
for a in $(find website.com -name '*.html' )
do 
  ### get level 2 but prefix it with the base name
  b=$(basename $a)
  wget -P $b -r -l 1 -A pdf http://$a 
done
  • probably the find will need some tunnig
  • perhaps add something like mv $b/website.com/files FINAL/$b to reduce the levels
Related Question