Grep and Cut – How to Use Grep and Cut in Script to Obtain Website URLs from HTML

cutgrepshell-scriptstring

I am trying to use grep and cut to extract URLs from an HTML file. The links look like:

<a href="http://examplewebsite.com/">

Other websites have .net, .gov, but I assume I could make the cut off point right before >. So I know I can use grep and cut somehow to cut off everything before http and after .com, but I have been stuck on it for a while.

Best Answer

As I said in my comment, it's generally not a good idea to parse HTML with Regular Expressions, but you can sometimes get away with it if the HTML you're parsing is well-behaved.

In order to only get URLs that are in the href attribute of <a> elements, I find it easiest to do it in multiple stages. From your comments, it looks like you only want the top level domain, not the full URL. In that case you can use something like this:

grep -Eoi '<a [^>]+>' source.html |
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

where source.html is the file containing the HTML code to parse.

This code will print all top-level URLs that occur as the href attribute of any <a> elements in each line. The -i option to the first grep command is to ensure that it will work on both <a> and <A> elements. I guess you could also give -i to the 2nd grep to capture upper case HREF attributes, OTOH, I'd prefer to ignore such broken HTML. :)

To process the contents of http://google.com/

wget -qO- http://google.com/ |
grep -Eoi '<a [^>]+>' | 
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

output

http://www.google.com.au
http://maps.google.com.au
https://play.google.com
http://www.youtube.com
http://news.google.com.au
https://mail.google.com
https://drive.google.com
http://www.google.com.au
http://www.google.com.au
https://accounts.google.com
http://www.google.com.au
https://www.google.com
https://plus.google.com
http://www.google.com.au

My output is a little different from the other examples as I get redirected to the Australian Google page.

Related Solutions

Cut / grep and df -h

The most comfortable solution for such task is awk:

df -h /dev/sda2 | awk 'NR==2{print$4}'

Or if more partitions are listed, you can pick the right line by the mount point:

df -h | awk '$1=="/dev/sda2"{print$4}'

Is also simple with sed, but less nice if you need to debug it a few mounts later :

df -h /dev/sda2 | sed -rn '2s/^((\S+)\s+){4}.*/\2/p'

df -h | sed -rn '/^\/dev\/sda2/s/^((\S+)\s+){4}.*/\2/p'

That supposes GNU sed. POSIX compatible syntax includes many escaping:

df -h /dev/sda2 | sed -n '2s/^\(\(\S\+\)\s\+\)\{4\}.*/\2/p'

df -h | sed -n '/^\/dev\/sda2/s/^\(\(\S\+\)\s\+\)\{4\}.*/\2/p'

Bash – extract filenames from html file containing multiple links

The best strategy would be to use a proper html parser that can spit out the value of all <a> tags.

Here, xmlstarlet is specifically an XML parser, and your HTML may not be well-formed XML, but you might get the idea:

echo '<html>
<a href="000000.jpg" title="image name.jpg" target="_blank">Image name.jpg</a>
</html>' | xmlstarlet sel -t -v //a

Image name.jpg

Best Answer

Related Solutions

Cut / grep and df -h

Bash – extract filenames from html file containing multiple links

Related Question