Shell – Fix array of incomplete links

shell-script

I'm writing a bash script that will find links in a page. It downloads the source code from a page and finds links in a tags and stores them into an array, $links. The problem though is that some links will not include the entire URL, just the sub-page or file. Say the source code of a page looks like this:

<h1>Julian's crappy webpage</h1>
<a href="http://one.com">Page One</a>
<a href="https://two.com">Page Two</a>
<a href="/three">Page Three</a>
<a href="four">Page Four</a>

After finding links, the array would look something like this:

http://one.com
https://two.com
/three
four

I also have a variable $URL which is a valid URL, e.g. "https://google.com"

I need a way to make all of the items in $links valid by doing different things based on the item. For every item in $links:

  • If the item starts with a /, then combine: $URL + ${links[n]}
  • If the item does not start with a "/" or "https://" or "http://", then combine: $URL + "/" + ${links[n]}

Expected response for example above:

http://one.com
https://two.com
https://google.com/three
https://google.com/four

Best Answer

Here is a sample script which you can use to implement this:

#!/bin/bash

shopt -s extglob
readonly URL="https://google.com"
links=($(grep -o -P '(?<=href=").*(?=")' source.html)) # read into array

for ((i=0; i<${#links[@]}; i++))
do
  case ${links[$i]} in
    http*)
      : # do nothing
      ;;
    /*)  
      links[$i]="$URL"${links[$i]}; # update array
      ;; 
    *)
      links[$i]="$URL/"${links[$i]}; # update array
      ;;
  esac
done

It assumes source.html is the downloaded source of the web page in the current directory.

Related Question