Shell command or series of commands similar to the Automator action “Get Link URLs from Webpages”

automatorbashterminal

I have a long list of URLs. Contained within each of these webpages, there are links I need to extract. The Automator action Get Link URLs from Webpages is a very useful action for this task. Unfortunately Automator itself does not handle heavy workloads very well and very often either crashes or hangs indefinitely. How might I go about this using Bash via the Mac OS X Terminal app?

Edit – this is the current script as it stands right now.

#!/bin/bash

echo "Enter up to 3 words"
read -p "" v1 v2 v3 

web="$HOME/web.txt"
tmp="$HOME/tmp.txt"
err="$HOME/err.txt"
fin="$HOME/fin.txt"
arc="$HOME/arc.txt"

n="$(awk 'END {print NR}' "$web")"
echo "Processing $n URLs..."

grep 'http' "$web" | \
while read -r url; do
    lynx -nonumbers -hiddenlinks=merge -dump -listonly "$url" 2>>"$err" | awk '!a[$0]++' >> "$tmp"
    sleep 1
    n=$((n-1))
    [[ $n -gt 0 ]] &&  echo "$n URLs left to process..." || echo "Processing Completed!"
done

grep -e "$v1" -e "$v2" -e "$v3" "$tmp" | sort -u | cat > "$fin" 
cat "$fin" >> "$arc"

for r in "Results This Session"; do echo "$(cat "$fin" | wc -l)" "$r"; done
for a in "URL's Archived"; do echo "$(cat "$arc" | wc -l)" "$a"; done

I added read -p to the beginning of the script. Are there limitations on the number of variables that can be used like this? I've successfully used up to 9 in testing. And is there a more practical way to write this? I tried read -p "" {v1..v9} which didn't work. I added a couple for loops at the end to indicate how much data was processed.

Current issues

  • sometimes I get an error

    sort: string comparison failed: Illegal byte sequence
    sort: Set LC_ALL='C' to work around the problem.
    

    however when add LS_ALL=C to the script it doesn't seem correct this.

Best Answer

Here's a script to get you started:

#!/bin/bash

urls="/path/to/Input_URLs_List.txt"
output="/path/to/Output_Link_URLs.txt"

n="$(awk 'END {print NR}' "$urls")"
echo "Processing $n URLs..."

cat "$urls" | \
while read url; do
    lynx -dump -listonly "$url" >> "$output"
    sleep 5
    n=$((n-1))
    echo "$n URLs left to process..."
done

This will dump all of the links into a file that you can further process based on what you're looking for. Additional code could be added to filter and process the output however without knowing what it is you need you'll have to work on it or ask some additional questions.


To clean up the output, use the following as an example:

Using "https://www.google.com" as one of the URLs the output would look like:

$ lynx -dump -listonly "https://www.google.com"


References

    1. https://www.google.com/imghp?hl=en&tab=wi
    2. https://maps.google.com/maps?hl=en&tab=wl
    3. https://play.google.com/?hl=en&tab=w8
    4. https://www.youtube.com/?tab=w1

I've truncated the output, there's actually 19 Link URLs.

To have the output just be a list of URLs, no numbers or whitespace, etc., use awk either in conjunction with lynx or afterwards.

$ lynx -dump -listonly "https://www.google.com" | awk '/:/{print $2}'
https://www.google.com/imghp?hl=en&tab=wi
https://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?tab=w1

So if you want the output file to be just the Link URLs, change the lynx command line to:

lynx -dump -listonly "$url" | awk '/:/{print $2}' >> "$output"

You can always process the contents of the output file later in the script or afterwards to get it down to the really wanted Link URLs and use a different search parameter in awk, e.g., I used ":" to both eliminate the blank lines in the lynx output and to show an example of how it can be filtered. In this example only Link URLs get redirected into the output file because only lines containing a : get output by awk, as all URLs should have a colon in them. The {print $2}, simplified in this explanation, removes everything to the left of the actual Link URL.


Here's an updated script which sorts and removes duplicate Link URLs:

#!/bin/bash

urls="/path/to/Input_URLs_List.txt"
output="/path/to/Output_Link_URLs.txt"

n="$(awk 'END {print NR}' "$urls")"
echo "Processing $n URLs..."

cat "$urls" | \
while read url; do
    lynx -dump -listonly "$url" | awk '/:/{print $2}' | sort | uniq >> "$output"
    sleep 5
    n=$((n-1))
    [[ $n -gt 0 ]] &&  echo "$n URLs left to process..." || echo "Processing Completed!"
done

Update to capture stderr output from lynx to a file:

To capture stderr output from lynx to a file, redirect stderr to a disk file, e.g., 2>>"$file" added after "$url", e.g.:

lynx -dump -listonly "$url" 2>>"$file" >> "$output"

Add errlog="/path/to/Lynx_Errors.txt" under output="/path/to/Output_Link_URLs.txt" and then change the lynx command line to, e.g.:

lynx -dump -listonly "$url" 2>>"$errlog" >> "$output"

Or:

lynx -dump -listonly "$url" 2>>"$errlog" | awk '/:/{print $2}' | sort | uniq >> "$output"

Example:

#!/bin/bash

urls="/path/to/Input_URLs_List.txt"
output="/path/to/Output_Link_URLs.txt"
errlog="/path/to/Lynx_Errors.txt"

n="$(awk 'END {print NR}' "$urls")"
echo "Processing $n URLs..."

cat "$urls" | \
while read url; do
    lynx -dump -listonly "$url" 2>>"$errlog" | awk '/:/{print $2}' | sort | uniq >> "$output"
    sleep 5
    n=$((n-1))
    [[ $n -gt 0 ]] &&  echo "$n URLs left to process..." || echo "Processing Completed!"
done