How to use awk to extract URL’s from a HTML file

awkhtml

I have an HTML file with javascript and CSS in the source. Listed in the JS is a series of URLs' embedded with other meta-data. I want to use awk to extract the URLs (all enclosed in double quotes with the http:// prefix) and dump the urls to stdout. But I do not know how to use awk, but it seems to be the tool to use.

{
title: "Dsssat",
artist: "cxpl djij awsoj e",
mp3: "http://somesite.com/seal/dsssat.mp3",
},

Best Answer

You can use grep. To include the double quotes:

grep -o '"http://[^"]*"' myfile.html

To exclude the double quotes:

grep -o 'http://[^"]*' myfile.html

Edit

You may want to do some further filtering to ensure that you only match the URLs in the JavaScript objects:

grep -o 'mp3: "http://[^"]*"' myfile.html | grep -o '"http://[^"]*"'

grep -o 'mp3: "http://[^"]*"' myfile.html | grep -o 'http://[^"]*'

Related Solutions

Extract Values from simple html file via grep/awk

> awk '/ID="idButtonTd"/ {printline=1; next;}; 
   printline==1 && /^[0-9]+\.[0-9]+$/ { print $0; }; { printline=0; }' file
18.000
0.00000

Bash – extract filenames from html file containing multiple links

The best strategy would be to use a proper html parser that can spit out the value of all <a> tags.

Here, xmlstarlet is specifically an XML parser, and your HTML may not be well-formed XML, but you might get the idea:

echo '<html>
<a href="000000.jpg" title="image name.jpg" target="_blank">Image name.jpg</a>
</html>' | xmlstarlet sel -t -v //a

Image name.jpg

Best Answer

Related Solutions

Extract Values from simple html file via grep/awk

Bash – extract filenames from html file containing multiple links

Related Question