This was a bug in bsdgrep
, relating to a variable that tracks the part of the current line still to scan that is overwritten with successive calls to the regular expression matching engine when multiple patterns are involved.
local fix
You can work around this to an extent by not using the -w
option, which relies upon this variable for correct operation and thus is failing, but instead using the regular expression extensions that match the beginning and endings of words, making your stopwords
file look like:
\<i\>
\<file\>
\<types\>
This workaround will also require that you do not use the -F
option.
Note that the documented regular expression components [[:<:]]
and [[:>:]]
that the re_format
manual tells you about will not work here. This is because the regular expression library that is compiled into bsdgrep
has GNU regular expression compatibility support turned on. This is another bug, which is reportedly fixed.
service fix
This bug was fixed earlier this year. The fix has not yet made it into the STABLE or RELEASE flavours of FreeBSD, but is reportedly in CURRENT.
For getting this into the MacOS version of grep
, that is derived from FreeBSD's bsdgrep
, please consult Apple. ☺
Further reading
The basic question is
My primary question is, is this a bug in MacOS? Or is Linux wrong in insisting that the variable needs to be set to a fully specified locale name?
and the POSIX page for environment variables shows the reason why others view the macOS configuration as incorrect:
[XSI] If the locale value has the form:
language[_territory][.codeset]
it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.
LC_COLLATE
, LC_CTYPE
, LC_MESSAGES
, LC_MONETARY
, LC_NUMERIC
, and LC_TIME
are defined to accept an additional field @ modifier, which allows the user to select a specific instance of localization data within a single category (for example, for selecting the dictionary as opposed to the character ordering of data). The syntax for these environment variables is thus defined as:
[language[_territory][.codeset][@modifier]]
For example, if a user wanted to interact with the system in French, but required to sort German text files, LANG and LC_COLLATE could be defined as:
LANG=Fr_FR
LC_COLLATE=De_DE
This could be extended to select dictionary collation (say) by use of the @ modifier field; for example:
LC_COLLATE=De_DE@dict
An implementation may support other formats.
If the locale value is not recognized by the implementation, the behavior is unspecified.
That is, they assume that POSIX prescribes a syntax for the locale settings.
An unwary reader would assume that POSIX defines the permissible forms for the environment variables so that the codeset value is optional, and not act as a replacement for the language. But that last "may" opens up a can of worms, in effect blessing this difference in interpretation. Apple can do whatever it wants, if it wants to provide valid locales which don't follow that pattern exactly.
@tripleee suggested that the page on Locale gives better information, but that is almost entirely a discussion of the locale definitions rather than providing guidance for interoperability (i.e., POSIX's ostensible goal).
Neither page addresses differences in the available locale settings (such as ".utf8" versus ".UTF-8"). Those are implementation-dependent, as noted on the POSIX page. That leaves users with the sole solution being to determine for themselves what locale settings are supported on the local and remote systems, and (ssh behavior here) determine how to set those on the remote system "compatibly".
Best Answer
I think this might be a bug in FreeBSD's grep. There's a bug report with similar issues.