I am trying to use grep and cut to extract URLs from an HTML file. The links look like:
<a href="http://examplewebsite.com/">
Other websites have .net
, .gov
, but I assume I could make the cut off point right before >
. So I know I can use grep and cut somehow to cut off everything before http and after .com, but I have been stuck on it for a while.
Best Answer
As I said in my comment, it's generally not a good idea to parse HTML with Regular Expressions, but you can sometimes get away with it if the HTML you're parsing is well-behaved.
In order to only get URLs that are in the
href
attribute of<a>
elements, I find it easiest to do it in multiple stages. From your comments, it looks like you only want the top level domain, not the full URL. In that case you can use something like this:where
source.html
is the file containing the HTML code to parse.This code will print all top-level URLs that occur as the
href
attribute of any<a>
elements in each line. The-i
option to the firstgrep
command is to ensure that it will work on both<a>
and<A>
elements. I guess you could also give-i
to the 2ndgrep
to capture upper caseHREF
attributes, OTOH, I'd prefer to ignore such broken HTML. :)To process the contents of
http://google.com/
output
My output is a little different from the other examples as I get redirected to the Australian Google page.