Hi trying to determine all the valid urls under a given domain without having the mirror the site locally.
People generally want to download all the pages but I just want to get a list of the direct urls under a given domain (e.g. www.example.com), which would be something like www.example.com/page1, www.example.com/page2, etc.
Is there a way to use wget to do this? or is there a better tool for this?
Best Answer
Here is a crude script:
The
grep
picks all thehref
s. Thesed
picks out the url part from thehref
. Thesort
filters out duplicate links.It will also work with
wget -O -
in place ofcurl -s
.Example output: