Grep to ignore patterns

grep

I am extracting URLs from a website using cURL as below.

curl www.somesite.com | grep "<a href=.*title=" > new.txt

My new.txt file is as below.

<a href="http://website1.com" title="something">
<a href="http://website1.com" information="something" title="something">
<a href="http://website2.com" title="some_other_thing">
<a href="http://website2.com" information="something" title="something">
<a href="http://websitenotneeded.com" title="something NOTNEEDED">

However, I need to extract only the below information.

<a href="http://website1.com" title="something">
<a href="http://website2.com" information="something" title="something">

I am trying to ignore the <a href which have information in them and whose title end with NOTNEEDED.

How can I modify my grep statement?

Best Answer

I'm not fully following your example + the description but it sounds like what you want is this:

$ grep -v "<a href=.*title=.*NOTNEEDED" sample.txt 
<a href="http://website1.com" title="something">
<a href="http://website1.com" information="something" title="something">
<a href="http://website2.com" title="some_other_thing">
<a href="http://website2.com" information="something" title="something">

So for your example:

$ curl www.example.com | grep -v "<a href=.*title=" | grep -v NOTNEEDED > new.txt

Related Solutions

Reading grep patterns from a file

The -f option specifies a file where grep reads patterns. That's just like passing patterns on the command line (with the -e option if there's more than one), except that when you're calling from a shell you may need to quote the pattern to protect special characters in it from being expanded by the shell.

The argument -E or -F or -P, if any, tells grep which syntax the patterns are written in. With no argument, grep expects basic regular expressions; with -E, grep expects extended regular expressions; with -P (if supported), grep expects Perl regular expressions; and with -F, grep expects literal strings. Whether the patterns come from the command line or from a file doesn't matter.

Note that the strings are substrings: if you pass a+b as a pattern then a line containing a+b+c is matched. If you want to search for lines containing exactly one of the supplied strings and no more, then pass the -x option.

Grep – Print Unmatched Patterns Using Grep with Patterns from File

You could use grep -o to print only the matching part and use the result as patterns for a second grep -v on the original patterns.txt file:

grep -oFf patterns.txt Strings.xml | grep -vFf - patterns.txt

Though in this particular case you could also use join + sort:

join -t\" -v1 -j2 -o 1.1 1.2 1.3 <(sort -t\" -k2 patterns.txt) <(sort -t\" -k2 strings.xml)

Best Answer

Related Solutions

Reading grep patterns from a file

Grep – Print Unmatched Patterns Using Grep with Patterns from File

Related Question