Confusion about difference in GNU/macOS grep output when using -o

grepregular expression

Why does BSD grep on macOS produce only the first word here:

$ echo "once upon a time" | grep -o "[a-z]*"
once

but all words here:

$ echo "once upon a time" | grep -o "[a-z][a-z]*"
once
upon
a
time

Or, using extended regular expressions:

$ echo "once upon a time" | grep -E -o "[a-z]*"
once

$ echo "once upon a time" | grep -E -o "[a-z]+"
once
upon
a
time

GNU grep will produce identical output for both [a-z]+ (or [a-z][a-z]*) and [a-z]*:

$ echo "once upon a time" | ggrep -E -o "[a-z]*"
once
upon
a
time

$ echo "once upon a time" | ggrep -E -o "[a-z]+"
once
upon
a
time

Best Answer

Collecting the thoughts of the comment section, it seems this comes down to how different grep implementations have decided to deal with empty matches, and the [a-z]* expressions matches on the empty string.

The -o option is not defined by POSIX, so how an implementation deals with it is left to the developers.

GNU grep obviously throws away empty matches, for example the match of the empty string after once when using [a-z]*, and continues to process the input from the next character onwards.

BSD grep, seems to be hitting the empty match and decides that, for whatever reason, that's enough, and stops there.

Stéphane mentions that the ast-open version of grep actually goes into an infinite loop at the empty match of [a-z]* after once and doesn't get past that point in the string.

OpenBSD grep seems to be different from macOS and FreeBSD grep in that adding the -w flag (which requires the matches to be delimited by word boundaries) makes [a-z]* return each word separately.

ilkkachu makes the observation that -o with a pattern that allows matching an empty string in some sense is confusing (or possibly at least ambiguous). Should all empty matches be printed? There are in fact infinitely many such matches after each word in the given string.

The OpenBSD source for grep (which exhibit the same behaviour as grep on macOS) contains (src/usr.bin/grep/util.c):

               if (r == 0) {
                        c = 1;
                        if (oflag && pmatch.rm_so != pmatch.rm_eo)
                                goto print;
                        break;
                }
        }
        if (oflag)
                return c;
print:

This basically says, if the pattern matched (r == 0) and if we are using -o (oflag), and if the match start offset is the same as the match end offset (pmatch.rm_so == pmatch.rm_eo, i.e. an empty match), then the result of the match is not printed and the matching on this particular line of input ends (return c with c == 1 for "match found").

Related Solutions

Grep caret appears to have no effect

To find a space, you have to use [:space:] inside another pair of brackets, which will look like [[:space:]]. You probably meant to express grep -E '^[[:space:]]*h'

To explain why your current one fails:

As it stands, [:space:]*h includes a character class looking for any of the characters: :, s, p, a, c, and e which occur any number of times (including 0), followed by h. This matches your string just fine, but if you run grep -o, you'll find that you've only matched the h, not the space.

If you add a carat to the beginning, either one of those letters or h must be at the beginning of the string to match, but none are, so it does not match.

GNU Parallel – grepping n lines for m regular expressions

It is due to GNU Parallel --pipe being slow.

cat bigfile |  parallel --pipe -L1000 --round-robin grep -f regexp.txt -

maxes out at around 100 MB/s.

In the man page example you will also find:

parallel --pipepart --block 100M -a bigfile grep -f regexp.txt

which does close to the same, but maxes out at 20 GB/s on a 64 core system.

parallel --pipepart --block 100M -a bigfile -k grep -f regexp.txt

should give exactly the same result as grep -f regexp.txt bigfile

Best Answer

Related Solutions

Grep caret appears to have no effect

GNU Parallel – grepping n lines for m regular expressions

Related Question