Use sed to delete all but a certain pattern

regular expressionsed

How do I get just the link part in the http-source of a link?

I have

<a href="http://unix.stackexchange.com/users/20661/">Unix &amp; Linux

and would like to get just

http://unix.stackexchange.com/users/20661/

I tried

sed 's/^.*(http.*)".*$/\1/g'

but that gives an error:

sed: -e expression #1, char 22: invalid reference \1 on `s' command's RHS

Best Answer

Try this:

sed -r 's/.*(http[^"]*)".*/\1/g'

On Mac OSX, try:

sed -E 's/.*(http[^"]*)".*/\1/g'

Notes

There are several items to note about this sed command:

sed 's/^.*(http.*)".*$/\1/g'
  1. The ^ is unnecessary. sed's regular expressions are always greedy. That means that, if a regex that begins with .* matches at all, it will always match from the beginning of the line.

  2. To make ( into a grouping character, it can either be escaped or extended regex can be turned on with the -r flag (-E on OSX). This flag often greatly reduces the number of escapes that you will need.

  3. Also, because regex are greedy, (http.*)" will match to the last double quote on the line, not the first. The URL will, however, end with the first double-quote. Instead, use (http[^"]*)" and the match will never extend beyond the first ".

  4. The dollar sign in .*$ is also superfluous. Again, because regex are greedy, if a regular expression that ends with .* matches, it will match to the end of the line.

Related Question