Bash – Extract Content Between Two Match Patterns

awkbashgrepscripts

I have file which contains different kind of text formats, my goal is to extract only HTML part and create a file with this HTML code. I think it is possible with grep or awk. My file contains also lines as this:

Sender name `<test@email.com>`

I wrote this script cat file1.html | grep -E "<[^>]*>". But the problem is that it outputs also the lines as Sender name, etc. I want to extract the content only after the <html> tag. So this is not useful for me:

Return-Path: <test@test.com>
    for <test@localhost> (single-drop); Thu, 21 Sep 2017 18:34:07 +0400 (+04)
Return-path: <test@test.com>
    (envelope-from <test@test.com>)
References: <test@test.com>
From: test user <test@test.com>
X-Forwarded-Message-Id: <test@test.com>
Message-ID: <test@test.com>
In-Reply-To: <test@test.com>

Best Answer

We can achieve this goal by the tool sed - stream editor for filtering and transforming text. The short answer is given under point 5 below. But I've decided to write a detailed explanation.

0. First let's create a simple file to test our commands:

$ printf '\nTop text\nSender <example@email.com>\n\n<html>\n\tThe inner text 1\n</html>\n\nMiddle text\n\n<HTML>\n\tThe inner text 2\n</HTML>\n\nBottom text\n' | tee example.file

Top text
Sender <example@email.com>

<html>
        The inner text 1
</html>

Middle text

<HTML>
        The inner text 2
</HTML>

Bottom text

1. We can crop everything between the tags <html> and </html>, including them, in this way:

$ sed -n -e '/<html>/,/<\/html>/p' example.file

<html>
        The inner text 1
</html>
  • The option -e script (--expression=script) adds a script to the commands to be executed. In this case the script that is added is '/<html>/,/<\/html>/p'. While we have only one script we can omit this option.

  • The option -n (--quiet, --silent) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tell sed what to print.

  • This additional command is the print command p, added to the end of the script. If sed wasn't started with an -n option, the p command will duplicate the input.

  • Finally by the two comma separated patterns - /<html>/,/<\/html>/ - we can specify a range. Please note we using \ to escape the special character / that plays role of delimiter here.

2. If we want to crop everything between the tags <html> and </html>, without printing them, we should add some additional commands:

$ sed -n '/<html>/,/<\/html>/{ /html>/d; p }' example.file

        The inner text 1
  • The curly braces, { and }, are used to group the commands.

  • The command d will delete each line that maces to the expression html>.

3. But, our example.file has also upper case <HTML> tags. So we should make the pattern match case insensitive. We could do that by adding the flag /I to the regular expressions:

$ sed -n '/<html>/I,/<\/html>/I{ /html>/Id; p }' example.file

        The inner text 1
        The inner text 2
  • The I modifier to regular-expression matching is a GNU extension which causes the REGEXP to be matched in a case-insensitive manner.

4. If we want to remove all HTML tags between the <html> tags we could add an additional command, that will parse and 'delete' the strings, which begin with < and end with >:

sed -n '/<html>/I,/<\/html>/I{ /html>/Id; s/<[^>]*>//g; p }' example.file
  • The command s will substitute the strings that mach to the expression /<[^>]*>/ with an empty string // - s/<old>/<new>/.

  • The pattern flag g will apply the replacement to all matches to the regexp, not just the first.

Probably we would want to omit the delete command in this case:

sed -n '/<html>/I,/<\/html>/I{ s/<[^>]*>//g; p }' example.file

5. To make the changes in place of the file and create a backup copy we can use the option -i, or we can to create a new file based on the sed's output by redirecting > the output to the new file:

sed -n '/<html>/I,/<\/html>/I p' example.file -i.bak
sed -n '/<html>/I,/<\/html>/I p' example.file > new.file

References: