Grep Man Page – Unexpected Results When Searching for Words in Headings

grepmanspecial characters

I'm running into weird behavior when trying to grep a man page on macOS. For example, the Bash man page clearly has an occurrence of the string NAME:

$ man bash | head -5 | tail -1
NAME

And if I grep for name I do get results, but if I grep for NAME I don't:

$ man bash | grep 'NAME'
$ man bash | grep NAME

I've tried other uppercase words that I know are in there, and searching for SHELL yields nothing whereas searching for BASH yields results.

What's going on here?

Update: Thanks for all the answers! I thought it worth adding the context in which I ran into this. I wanted to write a bash function to wrap man and in cases where I've tried to look up the man page for a shell builtin, jump to the relevant section of the Bash man page. There might be a better way, but here's what I've got currently:

man () {
  case "$(type -t "$1")" in
    builtin)
      local pattern="^ *$1"

      if bashdoc_match "$pattern \+[-[]"; then
        command man bash | less --pattern="$pattern +[-[]"
      elif bashdoc_match "$pattern\b"; then
        command man bash | less --pattern="$pattern[[:>:]]"
      else
        command man bash
      fi
      ;;
    keyword)
      command man bash | less --hilite-search --pattern='^SHELL GRAMMAR$'
      ;;
    *)
      command man "$@"
      ;;
  esac
}

bashdoc_match() {
  command man bash | col -b | grep -l "$1" > /dev/null
}

Best Answer

If you add a | sed -n l to that tail command, to show non-printable characters, you'll probably see something like:

N\bNA\bAM\bME\bE

That is, each character is written as X Backspace X. On modern terminals, the character ends up being written over itself (as Backspace aka BS aka \b aka ^H is the character that moves the cursor one column to the left) with no difference. But in ancient tele-typewriters, that would cause the character to appear in bold as it gets twice as much ink.

Still, pagers like more/less do understand that format to mean bold, so that's still what roff does to output bold text.

Some man implementations would call roff in a way that those sequences are not used (or internally call col -b -p -x to strip them like in the case of the man-db implementation (unless the MAN_KEEP_FORMATTING environment variable is set)), and don't invoke a pager when they detect the output is not going to a terminal (so man bash | grep NAME would work there), but not yours.

You can use col -b to remove those sequences (there are other types (_ BS X) as well for underline).

For systems using GNU roff (like GNU or FreeBSD), you can avoid those sequences being used in the first place by making sure the -c -b -u options are passed to grotty, for instance by making sure the -P-cbu options is passed to groff.

For instance by creating a wrapper script called groff containing:

#! /bin/sh -
exec /usr/bin/groff -P-cbu "$@"

That you put ahead of /usr/bin/groff in $PATH.

With macOS' man (also using GNU roff), you can create a man-no-overstrike.conf with:

NROFF /usr/bin/groff -mandoc -Tutf8 -P-cbu

And call man as:

man -C man-no-overstrike.conf bash | grep NAME

Still with GNU roff, if you set the GROFF_SGR environment variable (or don't set the GROFF_NO_SGR variable depending on how the defaults have been set at compile time), then grotty (as long as it's not passed the -c option) will use ANSI SGR terminal escape sequences instead of those BS tricks for character attributes. less understand them when called with the -R option.

FreeBSD's man calls grotty with the -c option unless you're asking for colours by setting the MANCOLOR variable (in which case -c is not passed to grotty and grotty reverts to the default of using ANSI SGR escape sequences there).

MANCOLOR=1 man bash | grep NAME

will work there.

On Debian, GROFF_SGR is not the default. If you do:

GROFF_SGR=1 man bash | grep NAME

however, because man's stdout is not a terminal, it takes it upon itself to also pass a GROFF_NO_SGR variable to grotty (I suppose so it can use col -bpx to strip the BS sequences as col doesn't know how to strip the SGR sequences, even though it still does it with MAN_KEEP_FORMATTING) which overrides our GROFF_SGR. You can do instead:

GROFF_SGR=1 MANPAGER='grep NAME' man bash

(in a terminal) to have the SGR escape sequences.

That time, you'll notice that some of those NAMEs do appear in bold on the terminal (and in a less -R pager). If you feed the output to sed -n l (MANPAGER='sed -n /NAME/l'), you'll see something like:

\033[1mNAME\033[0m$

Where \e[1m is the sequence to enable bold in ANSI compatible terminals, and \e[0m the sequence to revert all SGR attributes to the default.

On that text grep NAME works as that text does contain NAME, but you could still have problems if looking for text where only parts of it is in bold/underline...

Related Question