Multiple patterns with sed (regex AND or condition)

grepsedtext processing

I want to remove the unwanted data.
So my question is how do I delete lines above test1 if it does not contain test1 OR not ended with a quote?

20  /test1/catergory="Food"
20  /test1/target="Adults, \"Goblins\", Elderly,
Babies, \"Witch\",
Faries"
**This is some unwanted data to remove**
20  /test1/type="Western"
20  /test1/end=category
**This is some unwanted data to remove**
20  /test1/Purpose=
20  /test1/my_purpose="To create 
a fun-filled moment"
20  /test1/end=Purpose

Expected output:

20  /test1/catergory="Food"
20  /test1/target="Adults, \"Goblins\", Elderly,
Babies, \"Witch\",
Faries"
20  /test1/type="Western"
20  /test1/end=category
20  /test1/Purpose=
20  /test1/my_purpose="To create 
a fun-filled moment"
20  /test1/end=Purpose

I was stuck with these few commands :

1. grep -B1 'test1' test_long_sentence.txt
2. sed '/test1/!d' test_long_sentence.txt 
3. sed '/\"$/!d' test_long_sentence.txt

I do not know how to combine no. 2 and 3 (sed with multiple commands with regex and OR condition)

Best Answer

lex (or flex on Linux systems) is a program that takes a scanner/lexer specification and turns it into a C program. Its scanner specification is similar in nature to an awk program, but where awk is record oriented lex is "character oriented".

Using lex with the following source in lexer.l:

%x OUTPUT
%%
                        int quoted = 0;

^[0-9]*[ \t]*"/test1/"  { BEGIN OUTPUT;             ECHO; }
<OUTPUT>\n              { if (!quoted) { BEGIN 0; } ECHO; }
<OUTPUT>[^\\]["]        { quoted = !quoted;         ECHO; }
<OUTPUT>.               {                           ECHO; }
.|\n                    ;

This scanner uses an OUTPUT state to keep track of whether we want the current characters outputted or not. We enter this state with BEGIN OUTPUT when we find a line that looks like

<number>  /test1/

(this is handled by the first rule). We exit this state when a line ends and we're not currently scanning a quoted string (this is handled by the second rule).

A quoted string is started and ended with an un-escaped " character (the third rule). All other characters are passed through as is without action (the fourth rule).

While not in the OUTPUT state, we ignore everything (the last rule).

Note that this is a makeshift scanner written for your particular data. It does not handle quoted strings that ends with an escaped backslash ("some data \\"), but it works on the data that you have shown.

Building it:

$ make lexer
lex  -o lex.lexer.c lexer.l
cc -O2 -pipe    -o lexer lex.lexer.c  -ll
rm -f lex.lexer.c

(on Linux, when using flex, you may have to use make lexer LDLIBS=-ll)

Using it:

$ ./lexer <file
20  /test1/catergory="Food"
20  /test1/target="Adults, \"Goblins\", Elderly,
Babies, \"Witch\",
Faries"
20  /test1/type="Western"
20  /test1/end=category
20  /test1/Purpose=
20  /test1/my_purpose="To create
a fun-filled moment"
20  /test1/end=Purpose

Related Solutions

Grep – Find Multiple AND Patterns in Any Order

If your version of grep supports PCRE (GNU grep does this with the -P or --perl-regexp option), you can use lookaheads to match multiple words in any order:

grep -P '(?=.*?word1)(?=.*?word2)(?=.*?word3)^.*$'

This won't highlight the words, though. Lookaheads are zero-length assertions, they're not part of the matching sequence.

I think your piping solution should work for that. By default, grep only colors the output when it's going to a terminal, so only the last command in the pipeline does highlighting, but you can override this with --color=always.

grep --color=always foo | grep --color=always bar

Sed command that would ignore any commented match

You should not believe them if they tell you it cannot be done. You should believe them, however, if they tell you it's not easy.

sed '\|*/|!{ s|/\*|\n&|              #if ! */ repl 1st /* w/ \n/*
     h;      s|foo|bar|g;/\n/!b      #hold; repl all foo/bar; if ! \n branch
     G;      s|\n.*\n||;:n           #Get; clear difference; :new label
     n;      \|*/|!bn;s|^|\n/*|      #new line; if ! */ branch new label
     };s|*/|\n&|g                    #repl all */ w/ \n*/
       s|foo|&\nbar|g;:r             #repl all foo w/ foo\nbar
       s|\(/\*[^\n]*\)\nbar|\1|g;tr  #repl all /*[^\n]*\nbar w/ foo
       s|foo\n\(b\)|\1|g             #repl all foo\nbar w/ bar
       s|^\n/.||;s|\n||g             #clear any \n inserts
'    <<\INPUT
asfoo   /* asdfooasdfoo


asdfasdfoo
asdfasdfoo
foo */foo /*foo*/ foo
/*.
foo*/
foo
hello

INPUT

OUTPUT

asbar   /* asdfooasdfoo


asdfasdfoo
asdfasdfoo
foo */bar /*foo*/ bar
/*.
foo*/
bar
hello

Best Answer

Related Solutions

Grep – Find Multiple AND Patterns in Any Order

Sed command that would ignore any commented match

OUTPUT

Related Question