Grep and Cut – How to Use Grep and Cut in Script to Obtain Website URLs from HTML

cutgrepshell-scriptstring

I am trying to use grep and cut to extract URLs from an HTML file. The links look like:

<a href="http://examplewebsite.com/">

Other websites have .net, .gov, but I assume I could make the cut off point right before >. So I know I can use grep and cut somehow to cut off everything before http and after .com, but I have been stuck on it for a while.

Best Answer

As I said in my comment, it's generally not a good idea to parse HTML with Regular Expressions, but you can sometimes get away with it if the HTML you're parsing is well-behaved.

In order to only get URLs that are in the href attribute of <a> elements, I find it easiest to do it in multiple stages. From your comments, it looks like you only want the top level domain, not the full URL. In that case you can use something like this:

grep -Eoi '<a [^>]+>' source.html |
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

where source.html is the file containing the HTML code to parse.

This code will print all top-level URLs that occur as the href attribute of any <a> elements in each line. The -i option to the first grep command is to ensure that it will work on both <a> and <A> elements. I guess you could also give -i to the 2nd grep to capture upper case HREF attributes, OTOH, I'd prefer to ignore such broken HTML. :)

To process the contents of http://google.com/

wget -qO- http://google.com/ |
grep -Eoi '<a [^>]+>' | 
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

output

http://www.google.com.au
http://maps.google.com.au
https://play.google.com
http://www.youtube.com
http://news.google.com.au
https://mail.google.com
https://drive.google.com
http://www.google.com.au
http://www.google.com.au
https://accounts.google.com
http://www.google.com.au
https://www.google.com
https://plus.google.com
http://www.google.com.au

My output is a little different from the other examples as I get redirected to the Australian Google page.

Related Question