Grep regular expression solution (greedy not working)

command linegrepregular expressiontext processing

I have the following text in the data.txt file

:MENU1
0. public
1. admin
2. webmail

:SYNTAX
! opt1, ... :

:ERROR1
Error #1, blah... blah.. blah...
Please do ...

:ERROR2
Error #2 ...

and I want to use a regular expression (PERL syntax) to extract the part from :MENU1 to the next first :, but dropping MENU1 and the last : from the result.

Been trying several regex's but in the closest solution I got
I can't put the 'greedy' option to work and cant't discard the last ":"

grep -Poz "^:MENU1\K[\w\W]*:"

this works with grep …
but brings all the text until the last ":" …
I want only until the next first ":" after :MENU1:

0. public
1. admin
2. webmail

(note the final blank line)

Best Answer

The pattern *: will match everything until the last :. To stop at the next : you need *?:. E.g.:

% grep -Poz '^:MENU1\K[\w\W]*?:' data.txt 

0. public
1. admin
2. webmail

:

You can strip the first line by matching the newline before your \K. E.g.:

% grep -Poz '^:MENU1\n\K[\w\W]*?:' data.txt 
0. public
1. admin
2. webmail

:

To eat the empty line and the : you can match and discard that text. E.g.:

% grep -Poz '^:MENU1\n\K[\w\W]*?(?=\n+:)' data.txt 
0. public
1. admin
2. webmail

next we can simplify your character class, to match on anything but ::

% grep -Poz '^:MENU1\n\K[^:]*?(?=\n+:)' data.txt 
0. public
1. admin
2. webmail

And finally we can rewrite the initial part of the match:

% grep -Poz '(?<=:MENU1\n)[^:]*?(?=\n+:)' data.txt 
0. public
1. admin
2. webmail

This is similar to what @terdon came up with, but this takes care of the blank lines without another call to grep.

This final regex makes use of look-around assertions. The (?<=pattern) is a look-behind assertion that lets you match the pattern but not include it as part of the output. The (?=pattern) is a look-ahead assertion and lets us match on the trailing pattern without including it in the output.

Related Solutions

Greedy and lazy regular expressions (comprehension question)

It's not the shortest possible match, just a short match. Greedy mode tries to find the last possible match, lazy mode the first possible match. But the first possible match is not necessarily the shortest one.

Take the input string foobarbaz and the regexp o.*a (greedy) or o.*?a (lazy).

The shortest possible match in this input string would be oba.

However the RegExp looks for matches from left to right, so the o finds the first o in foobarbaz. And if the rest of the pattern produces a match, that's where it stays.

Following the first o, .* (greedy) eats obarbaz (the entire string) and then backtracks in order to match the rest of the pattern (a). Thus it finds the last a in baz and ends up matching oobarba.

Following the first o, .*? (lazy) doesn't eat the entire string, instead it looks for the first occurrence of the rest of the pattern. So first it sees the second o, which doesn't match a, then it sees b, which doesn't match a, then it sees a, which matches a, and because it's lazy that's where it stops. (and the result is ooba, but not oba)

So while it's not THE shortest possible one, it's a shorter one than the greedy version.

Grep – How to Find Closing Bracket

This sed script prints the line number of the line matching /^};/ in the range of lines from /xkb_symbols "dvorak" {/ to the next /^};/ (which will be the same }; as the one we get the line number for):

/xkb_symbols "dvorak" {/,/^};/{
        /^};/=
}

If you need both start and end line numbers:

/xkb_symbols "dvorak" {/,/^};/{
        /xkb_symbols "dvorak" {/=
        /^};/=
}

$ sed -n -f tiny_script.sed /usr/share/X11/xkb/symbols/us
192
248

Alternatively:

$ sed -n -f - /usr/share/X11/xkb/symbols/us <<END_SED
/xkb_symbols "dvorak" {/,/^};/{
        /xkb_symbols "dvorak" {/=
        /^};/=
}
END_SED

EDIT: To get these two numbers in a variable, assuming you're using Bash:

pos=( $( sed -n -f - /usr/share/X11/xkb/symbols/us <<END_SED
        /xkb_symbols "dvorak" {/,/^};/{
                /xkb_symbols "dvorak" {/=
                /^};/=
        }
END_SED
) )

echo "start = " ${pos[0]}
echo "end   = " ${pos[1]}

Also, hi! Another Dvorak user!

Best Answer

Related Solutions

Greedy and lazy regular expressions (comprehension question)

Grep – How to Find Closing Bracket

Related Question