Multiple patterns with sed (regex AND or condition)

grepsedtext processing

I want to remove the unwanted data.
So my question is how do I delete lines above test1 if it does not contain test1 OR not ended with a quote?

20  /test1/catergory="Food"
20  /test1/target="Adults, \"Goblins\", Elderly,
Babies, \"Witch\",
Faries"
**This is some unwanted data to remove**
20  /test1/type="Western"
20  /test1/end=category
**This is some unwanted data to remove**
20  /test1/Purpose=
20  /test1/my_purpose="To create 
a fun-filled moment"
20  /test1/end=Purpose

Expected output:

20  /test1/catergory="Food"
20  /test1/target="Adults, \"Goblins\", Elderly,
Babies, \"Witch\",
Faries"
20  /test1/type="Western"
20  /test1/end=category
20  /test1/Purpose=
20  /test1/my_purpose="To create 
a fun-filled moment"
20  /test1/end=Purpose

I was stuck with these few commands :

1. grep -B1 'test1' test_long_sentence.txt
2. sed '/test1/!d' test_long_sentence.txt 
3. sed '/\"$/!d' test_long_sentence.txt

I do not know how to combine no. 2 and 3 (sed with multiple commands with regex and OR condition)

Best Answer

lex (or flex on Linux systems) is a program that takes a scanner/lexer specification and turns it into a C program. Its scanner specification is similar in nature to an awk program, but where awk is record oriented lex is "character oriented".

Using lex with the following source in lexer.l:

%x OUTPUT
%%
                        int quoted = 0;

^[0-9]*[ \t]*"/test1/"  { BEGIN OUTPUT;             ECHO; }
<OUTPUT>\n              { if (!quoted) { BEGIN 0; } ECHO; }
<OUTPUT>[^\\]["]        { quoted = !quoted;         ECHO; }
<OUTPUT>.               {                           ECHO; }
.|\n                    ;

This scanner uses an OUTPUT state to keep track of whether we want the current characters outputted or not. We enter this state with BEGIN OUTPUT when we find a line that looks like

<number>  /test1/

(this is handled by the first rule). We exit this state when a line ends and we're not currently scanning a quoted string (this is handled by the second rule).

A quoted string is started and ended with an un-escaped " character (the third rule). All other characters are passed through as is without action (the fourth rule).

While not in the OUTPUT state, we ignore everything (the last rule).

Note that this is a makeshift scanner written for your particular data. It does not handle quoted strings that ends with an escaped backslash ("some data \\"), but it works on the data that you have shown.

Building it:

$ make lexer
lex  -o lex.lexer.c lexer.l
cc -O2 -pipe    -o lexer lex.lexer.c  -ll
rm -f lex.lexer.c

(on Linux, when using flex, you may have to use make lexer LDLIBS=-ll)

Using it:

$ ./lexer <file
20  /test1/catergory="Food"
20  /test1/target="Adults, \"Goblins\", Elderly,
Babies, \"Witch\",
Faries"
20  /test1/type="Western"
20  /test1/end=category
20  /test1/Purpose=
20  /test1/my_purpose="To create
a fun-filled moment"
20  /test1/end=Purpose
Related Question