I have file which contains different kind of text formats, my goal is to extract only HTML part and create a file with this HTML code. I think it is possible with grep
or awk
. My file contains also lines as this:
Sender name `<test@email.com>`
I wrote this script cat file1.html | grep -E "<[^>]*>"
. But the problem is that it outputs also the lines as Sender name
, etc. I want to extract the content only after the <html>
tag. So this is not useful for me:
Return-Path: <test@test.com>
for <test@localhost> (single-drop); Thu, 21 Sep 2017 18:34:07 +0400 (+04)
Return-path: <test@test.com>
(envelope-from <test@test.com>)
References: <test@test.com>
From: test user <test@test.com>
X-Forwarded-Message-Id: <test@test.com>
Message-ID: <test@test.com>
In-Reply-To: <test@test.com>
Best Answer
We can achieve this goal by the tool
sed
- stream editor for filtering and transforming text. The short answer is given under point 5 below. But I've decided to write a detailed explanation.0. First let's create a simple file to test our commands:
1. We can crop everything between the tags
<html>
and</html>
, including them, in this way:The option
-e script
(--expression=script
) adds a script to the commands to be executed. In this case the script that is added is'/<html>/,/<\/html>/p'
. While we have only one script we can omit this option.The option
-n
(--quiet
,--silent
) suppress automatic printing of pattern space, and along with this option we should use some additional command(s) to tellsed
what to print.This additional command is the print command
p
, added to the end of the script. Ifsed
wasn't started with an-n
option, thep
command will duplicate the input.Finally by the two comma separated patterns -
/<html>/,/<\/html>/
- we can specify a range. Please note we using\
to escape the special character/
that plays role of delimiter here.2. If we want to crop everything between the tags
<html>
and</html>
, without printing them, we should add some additional commands:The curly braces,
{
and}
, are used to group the commands.The command
d
will delete each line that maces to the expressionhtml>
.3. But, our
example.file
has also upper case<HTML>
tags. So we should make the pattern match case insensitive. We could do that by adding the flag/I
to the regular expressions:I
modifier to regular-expression matching is a GNU extension which causes the REGEXP to be matched in a case-insensitive manner.4. If we want to remove all HTML tags between the
<html>
tags we could add an additional command, that will parse and 'delete' the strings, which begin with<
and end with>
:The command
s
will substitute the strings that mach to the expression/<[^>]*>/
with an empty string//
-s/<old>/<new>/
.The pattern flag
g
will apply the replacement to all matches to the regexp, not just the first.Probably we would want to omit the delete command in this case:
5. To make the changes in place of the file and create a backup copy we can use the option
-i
, or we can to create a new file based on thesed
's output by redirecting>
the output to the new file:References: