How to use awk to extract URL’s from a HTML file

awkhtml

I have an HTML file with javascript and CSS in the source. Listed in the JS is a series of URLs' embedded with other meta-data. I want to use awk to extract the URLs (all enclosed in double quotes with the http:// prefix) and dump the urls to stdout. But I do not know how to use awk, but it seems to be the tool to use.

{
title: "Dsssat",
artist: "cxpl djij awsoj e",
mp3: "http://somesite.com/seal/dsssat.mp3",
},

Best Answer

You can use grep. To include the double quotes:

grep -o '"http://[^"]*"' myfile.html

To exclude the double quotes:

grep -o 'http://[^"]*' myfile.html

Edit

You may want to do some further filtering to ensure that you only match the URLs in the JavaScript objects:

grep -o 'mp3: "http://[^"]*"' myfile.html | grep -o '"http://[^"]*"'

grep -o 'mp3: "http://[^"]*"' myfile.html | grep -o 'http://[^"]*'
Related Question