Bash – Extract Content Between Two Match Patterns

awkbashgrepscripts

I have file which contains different kind of text formats, my goal is to extract only HTML part and create a file with this HTML code. I think it is possible with grep or awk. My file contains also lines as this:

Sender name `<test@email.com>`

I wrote this script cat file1.html | grep -E "<[^>]*>". But the problem is that it outputs also the lines as Sender name, etc. I want to extract the content only after the <html> tag. So this is not useful for me:

References: <test@test.com>
From: test user <test@test.com>
Message-ID: <test@test.com>
In-Reply-To: <test@test.com>

Best Answer

We can achieve this goal by the tool sed - stream editor for filtering and transforming text. The short answer is given under point 5 below. But I've decided to write a detailed explanation.

0. First let's create a simple file to test our commands:

$ printf '\nTop text\nSender <example@email.com>\n\n<html>\n\tThe inner text 1\n</html>\n\nMiddle text\n\n<HTML>\n\tThe inner text 2\n</HTML>\n\nBottom text\n' | tee example.file

Top text
Sender <example@email.com>

<html>
        The inner text 1
</html>

Middle text

<HTML>
        The inner text 2
</HTML>

Bottom text

1. We can crop everything between the tags <html> and </html>, including them, in this way:

$ sed -n -e '/<html>/,/<\/html>/p' example.file

<html>
        The inner text 1
</html>

The option -e script (--expression=script) adds a script to the commands to be executed. In this case the script that is added is '/<html>/,/<\/html>/p'. While we have only one script we can omit this option.
The option -n (--quiet, --silent) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tell sed what to print.
This additional command is the print command p, added to the end of the script. If sed wasn't started with an -n option, the p command will duplicate the input.
Finally by the two comma separated patterns - /<html>/,/<\/html>/ - we can specify a range. Please note we using \ to escape the special character / that plays role of delimiter here.

2. If we want to crop everything between the tags <html> and </html>, without printing them, we should add some additional commands:

$ sed -n '/<html>/,/<\/html>/{ /html>/d; p }' example.file

        The inner text 1

The curly braces, { and }, are used to group the commands.
The command d will delete each line that maces to the expression html>.

3. But, our example.file has also upper case <HTML> tags. So we should make the pattern match case insensitive. We could do that by adding the flag /I to the regular expressions:

$ sed -n '/<html>/I,/<\/html>/I{ /html>/Id; p }' example.file

        The inner text 1
        The inner text 2

The I modifier to regular-expression matching is a GNU extension which causes the REGEXP to be matched in a case-insensitive manner.

4. If we want to remove all HTML tags between the <html> tags we could add an additional command, that will parse and 'delete' the strings, which begin with < and end with >:

sed -n '/<html>/I,/<\/html>/I{ /html>/Id; s/<[^>]*>//g; p }' example.file

The command s will substitute the strings that mach to the expression /<[^>]*>/ with an empty string // - s/<old>/<new>/.
The pattern flag g will apply the replacement to all matches to the regexp, not just the first.

Probably we would want to omit the delete command in this case:

sed -n '/<html>/I,/<\/html>/I{ s/<[^>]*>//g; p }' example.file

5. To make the changes in place of the file and create a backup copy we can use the option -i, or we can to create a new file based on the sed's output by redirecting > the output to the new file:

sed -n '/<html>/I,/<\/html>/I p' example.file -i.bak

sed -n '/<html>/I,/<\/html>/I p' example.file > new.file

References:

Related Solutions

Ubuntu – Print only the first match once

Try this,

for i in $(cat ~/jlog/"$2"); do
        grep "$1" ~/jlog/"$2" |
        awk '/\([a-zA-Z0-9.]+/ {print $7; exit}' 
done;

exit in the awk command exits after printing the first match.

Just pipe the output of for command to the below awk command,

for .... | awk -F'[(/]' '{print $2;exit}'

Ubuntu – How to find all patterns between two characters

First of all, your grep -Po '"\K[^"]*' file idea fails because grep sees both "One" and ". the second is here" as being inside quotes. Personally, I'd probably just do

$ grep -oP '"[^"]+"' file | tr -d '"'
One
Two 
 Three 
Four

But that is two commands. To do it with a single command, you could use one of:

Perl
```
$ perl -lne '@F=/"\s*([^"]+)\s*"/g; print for @F' file 
One
Two 
Three 
Four
```
Here, the @F array holds all matches of the regex (a quote, followed by as many non-" as possible until the next "). The print for @F just means "print each element of @F.

Perl

$ perl -F'"' -lne 'for($i=1;$i<=$#F;$i+=2){print $F[$i]}' file 
One
Two 
 Three 
Four

To remove leading/trailing spaces from each match, use this:

perl -F'"' -lne 'for($i=1;$i<=$#F;$i+=2){$F[$i]=~s/^\s*|\s$//; print $F[$i]}' file

Here, Perl is behaving like awk. The -a switch causes it to automatically split input lines into fields on the character given by -F. Since I have given it ", the fields are:

$ perl -F'"' -lne 'for($i=0;$i<=$#F;$i++){print "Field $i: $F[$i]"}' file 
Field 0: first matched is 
Field 1: One
Field 2: . the second is here
Field 3: Two 
Field 0: and here are in second line
Field 1:  Three 
Field 2: 
Field 3: Four
Field 4: .

Because we are looking for text between two consecutive field separators, we know we want every second field. So, for($i=1;$i<=$#F;$i+=2){print $F[$i]} will print the ones we care about.

The same idea but in awk:

$ awk -F'"' '{for(i=2;i<=NF;i+=2){print $(i)}}' file 
One
Two 
 Three 
Four

Best Answer

Related Solutions

Ubuntu – Print only the first match once

Ubuntu – How to find all patterns between two characters

Related Question