I'm writing a bash script that will find links in a page. It downloads the source code from a page and finds links in a tags and stores them into an array, $links. The problem though is that some links will not include the entire URL, just the sub-page or file. Say the source code of a page looks like this:
<h1>Julian's crappy webpage</h1>
<a href="http://one.com">Page One</a>
<a href="https://two.com">Page Two</a>
<a href="/three">Page Three</a>
<a href="four">Page Four</a>
After finding links, the array would look something like this:
http://one.com
https://two.com
/three
four
I also have a variable $URL which is a valid URL, e.g. "https://google.com"
I need a way to make all of the items in $links valid by doing different things based on the item. For every item in $links:
- If the item starts with a /, then combine: $URL + ${links[n]}
- If the item does not start with a "/" or "https://" or "http://", then combine: $URL + "/" + ${links[n]}
Expected response for example above:
http://one.com
https://two.com
https://google.com/three
https://google.com/four
Best Answer
Here is a sample script which you can use to implement this:
It assumes
source.html
is the downloaded source of the web page in the current directory.