Regular Expressions – Handling `<` and `\?` Strange Matches

grepregular expressionsed

Assume, there is a simple test PHP script:

<?php
$a = ($argc == 2 ? $argv[1] : 10);
for ($i = 0; $i < $a; $i++) {
  echo '.';
}
echo PHP_EOL;

Now, I do a grep or a conditional sed on the file:

grep '<' test.php

yields the two lines containing the <. That's clear.

grep '\?' test.php

yields the two lines containing the question mark. That's clear.

grep '<\?' test.php

returns all lines – why? I expected it to output only the first line. But maybe, the < should be escaped, which yields another unexpected output.

sed -n '/pattern/p' test.php

yields the same results.

I tried to get an answer at https://regex101.com/, but by my surprise, the website shows what I expect. Also, a quick and dirty PHP implementation of grep yields what I expect:

<?php
if (($fh = fopen($argv[2], 'r')) !== false) {
  while ($line = fgets($fh)) {
    if (mb_ereg($argv[1], $line) !== false) echo $line;
  }
}

My question is: What is the reasoning behind those matches in grep and sed?

Best Answer

grep’s default behaviour is to interpret regular expressions as basic regular expressions (BREs). These don’t support ? as a special symbol; it’s the basic character:

grep '<?' test.php

thus gives the result you’re expecting.

GNU grep treats escaped versions of symbols which have special meaning in extended regular expressions but not in BREs as special symbols, even in BREs: thus in a BRE, \? has the same meaning as ? in an ERE. So grep '<\?' matches zero or one <, which matches everything (and highlights < if you have colour output enabled).

The same reasoning applied to sed.

Related Question