Wget: recursively retrieve urls from specific website

web-crawlerwget

I'm trying to recursively retrieve all possible urls (internal page urls) from a website.

Can you please help me out with wget? or is there any better alternative to achieve this?
I do not want to download the any content from the website, but just want to get the urls of the same domain.

Thanks!

EDIT

I tried doing this in wget, and grep the outlog.txt file later. Not sure, if this is the right way to do it. But, It works!

$ wget -R.jpg,.jpeg,.gif,.png,.css -c -r http://www.example.com/ -o urllog.txt
$ grep -e " http" urllog1.txt | awk '{print $3}'

Best Answer

You could also use something like nutch I've only ever used it to crawl internal links on a site and index them into solr but according to this post it can also do external links, depending on what you want to do with the results it may be a bit overkill though.

Related Question