Shell – Fix array of incomplete links

shell-script

I'm writing a bash script that will find links in a page. It downloads the source code from a page and finds links in a tags and stores them into an array, $links. The problem though is that some links will not include the entire URL, just the sub-page or file. Say the source code of a page looks like this:

<h1>Julian's crappy webpage</h1>
<a href="http://one.com">Page One</a>
<a href="https://two.com">Page Two</a>
<a href="/three">Page Three</a>
<a href="four">Page Four</a>

After finding links, the array would look something like this:

http://one.com
https://two.com
/three
four

I also have a variable $URL which is a valid URL, e.g. "https://google.com"

I need a way to make all of the items in $links valid by doing different things based on the item. For every item in $links:

If the item starts with a /, then combine: $URL + ${links[n]}
If the item does not start with a "/" or "https://" or "http://", then combine: $URL + "/" + ${links[n]}

Expected response for example above:

http://one.com
https://two.com
https://google.com/three
https://google.com/four

Best Answer

Here is a sample script which you can use to implement this:

#!/bin/bash

shopt -s extglob
readonly URL="https://google.com"
links=($(grep -o -P '(?<=href=").*(?=")' source.html)) # read into array

for ((i=0; i<${#links[@]}; i++))
do
  case ${links[$i]} in
    http*)
      : # do nothing
      ;;
    /*)  
      links[$i]="$URL"${links[$i]}; # update array
      ;; 
    *)
      links[$i]="$URL/"${links[$i]}; # update array
      ;;
  esac
done

It assumes source.html is the downloaded source of the web page in the current directory.

Related Solutions

Shell – ${#array} vs ${#array[@]}

In zsh, personal preference. In other shells $array may only expand to the first element, thus ${#array} would output the length of the first element.

So, if you want to be little more portable between shells specifying the [@] would work.

In zsh, $array expands in the same way $array[*] would, which differs depending on if they appear within quotes or not. Should they appear within double quotes "$array" would expand and be delimited by the first character of IFS which by default is space

zsh% touch {1..10}; a=(*)
zsh% printf '<%s> ' $a
<1> <10> <2> <3> <4> <5> <6> <7> <8> <9>     
zsh% printf '<%s> ' "$a"
<1 10 2 3 4 5 6 7 8 9> 
zsh% IFS=:
zsh% print "$a"
1:10:2:3:4:5:6:7:8:9
zsh% print "$a[@]"
1 10 2 3 4 5 6 7 8 9
zsh% IFS=$' \t\n'
zsh% rm "$a"
rm: cannot remove ‘1 10 2 3 4 5 6 7 8 9’: No such file or directory

Changing IFS is rarely needed which prompted my original "Personal preferences" response. But just to clarify there is a few differences between the two when used without the # flag, they are is just very subtle.

I prefer $array[@] also since it's behavior doesn't change depending on whether or not it appears within quotes. That and internal whitespace that an element may have is preserved.

Bash – Referencing bash array variables from another array

Bash 4.3 and later supports "name references", or namerefs (a similar concept exists in ksh93, but the scoping is annoyingly different):

#!/bin/bash

array1=('array1string1' 'array1string2')
array2=('array2string1' 'array2string2')

array_names=('array1' 'array2')

for a in "${array_names[@]}"; do
    declare -n arr="$a"

    for b in "${arr[@]}"; do
        echo "$b"
    done
done

The variable arr is a nameref that acts like an alias for the named variable (the variable with name $a in this example).

Without namerefs, in earlier Bash versions, one solution would be to create a new array that contains all the elements from the other arrays:

all=( "${array1[@]}" "${array2[@]}" )

... a bit like the array_names array in the question but with the contents of all arrays, and then iterate over "${all[@]}".

It's also possible to use eval, but the resulting code looks astoundingly awful.

See glenn jackman's answer for a variation with variable indirection (introduced in its current form with Bash version 2).

Best Answer

Related Solutions

Shell – ${#array} vs ${#array[@]}

Bash – Referencing bash array variables from another array

Related Question