Why does BSD grep
on macOS produce only the first word here:
$ echo "once upon a time" | grep -o "[a-z]*"
once
but all words here:
$ echo "once upon a time" | grep -o "[a-z][a-z]*"
once
upon
a
time
Or, using extended regular expressions:
$ echo "once upon a time" | grep -E -o "[a-z]*"
once
$ echo "once upon a time" | grep -E -o "[a-z]+"
once
upon
a
time
GNU grep
will produce identical output for both [a-z]+
(or [a-z][a-z]*
) and [a-z]*
:
$ echo "once upon a time" | ggrep -E -o "[a-z]*"
once
upon
a
time
$ echo "once upon a time" | ggrep -E -o "[a-z]+"
once
upon
a
time
Best Answer
Collecting the thoughts of the comment section, it seems this comes down to how different
grep
implementations have decided to deal with empty matches, and the[a-z]*
expressions matches on the empty string.The
-o
option is not defined by POSIX, so how an implementation deals with it is left to the developers.GNU
grep
obviously throws away empty matches, for example the match of the empty string afteronce
when using[a-z]*
, and continues to process the input from the next character onwards.BSD
grep
, seems to be hitting the empty match and decides that, for whatever reason, that's enough, and stops there.Stéphane mentions that the
ast-open
version ofgrep
actually goes into an infinite loop at the empty match of[a-z]*
afteronce
and doesn't get past that point in the string.OpenBSD
grep
seems to be different from macOS and FreeBSDgrep
in that adding the-w
flag (which requires the matches to be delimited by word boundaries) makes[a-z]*
return each word separately.ilkkachu makes the observation that
-o
with a pattern that allows matching an empty string in some sense is confusing (or possibly at least ambiguous). Should all empty matches be printed? There are in fact infinitely many such matches after each word in the given string.The OpenBSD source for
grep
(which exhibit the same behaviour asgrep
on macOS) contains (src/usr.bin/grep/util.c
):This basically says, if the pattern matched (
r == 0
) and if we are using-o
(oflag
), and if the match start offset is the same as the match end offset (pmatch.rm_so == pmatch.rm_eo
, i.e. an empty match), then the result of the match is not printed and the matching on this particular line of input ends (return c
withc == 1
for "match found").