Wget: recursively retrieve urls from specific website

web-crawlerwget

I'm trying to recursively retrieve all possible urls (internal page urls) from a website.

Can you please help me out with wget? or is there any better alternative to achieve this?
I do not want to download the any content from the website, but just want to get the urls of the same domain.

Thanks!

EDIT

I tried doing this in wget, and grep the outlog.txt file later. Not sure, if this is the right way to do it. But, It works!

$ wget -R.jpg,.jpeg,.gif,.png,.css -c -r http://www.example.com/ -o urllog.txt
$ grep -e " http" urllog1.txt | awk '{print $3}'

Best Answer

You could also use something like nutch I've only ever used it to crawl internal links on a site and index them into solr but according to this post it can also do external links, depending on what you want to do with the results it may be a bit overkill though.

Related Solutions

Linux – Using Wget to Recursively Crawl a Site and Download Images

Why won't you try to use wget -A jpg,jpeg -r http://example.com?

Linux – Dowloading all Urls accessible under a given domain with wget without saving the actual pages

Here is a crude script:

curl -s whaturl |
  grep -o "<a href=[^>]*>" |
  sed -r 's/<a href="([^"]*)".*>/\1/' |
  sort -u

The grep picks all the hrefs. The sed picks out the url part from the href. The sort filters out duplicate links.

It will also work with wget -O - in place of curl -s.

Example output:

$ curl -s http://stackexchange.com/users/148837/lesmana?tab=accounts | grep -o "<a href=[^>]*>" | sed -r 's/<a href="([^"]*)".*>/\1/' | sort -u
/
/about
/about/contact
/blogs
/leagues
/legal
/legal/privacy-policy
/newsletters
/questions
/sites
/users/148837/lesmana
/users/148837/lesmana?tab=activity
/users/148837/lesmana?tab=favorites
/users/148837/lesmana?tab=reputation
/users/148837/lesmana?tab=subscriptions
/users/148837/lesmana?tab=top
/users/login?returnurl=%2fusers%2f148837%2flesmana%3ftab%3daccounts
http://area51.stackexchange.com/users/16563/lesmana
http://askubuntu.com/users/1366/
http://blog.stackexchange.com
http://blog.stackoverflow.com/2009/06/attribution-required/
http://chat.stackexchange.com/
http://creativecommons.org/licenses/by-sa/3.0/
http://gaming.stackexchange.com/users/2790/
http://meta.stackoverflow.com
http://meta.stackoverflow.com/users/147747/
http://programmers.stackexchange.com/users/116/
http://serverfault.com/users/45166/
http://stackoverflow.com/users/360899/
http://superuser.com/users/39401/
http://twitter.com/stackexchange
http://unix.stackexchange.com/users/1170/
http://www.facebook.com/stackexchange
https://plus.google.com/+StackExchange

Best Answer

Related Solutions

Linux – Using Wget to Recursively Crawl a Site and Download Images

Linux – Dowloading all Urls accessible under a given domain with wget without saving the actual pages

Related Question