How to access man pages as structured content

command linedocumentationgroffmanroff

I'm building a resource that references man pages, and I'm wondering if anyone knows of a way to access man pages as structured data? My current approach is to do a lot if REGEXing, but this is tedious and prone to errors.

I'm not an expert on *nix, but what I understand about man pages is that they are basically text files with a particular syntax that is parsable by the man command. This makes me a little skeptical that there might be an easy way to, say, access a list of the options or flags. But maybe there's a way to do it that I don't know.

Best Answer

You might peek at how the fish shell builds its completions from the man pages in particular how __fish_complete_man works. An easier option assuming groff might be to emit HTML and then use one of the multitude of HTML parsers out there to get what you want:

$ groff -T html -mdoc xpquery.1 | xpquery -p HTML '//p[b="xpquery"][2]' -
<p style="margin-left:17%;"><b>xpquery</b>
[<b>−E </b><i>encoding</i>]
[<b>−n </b><i>namespace</i>]
[<b>−p </b><i>method</i>]
[<b>−S </b><i>xpath-subquery</i>]
[<b>−t </b><i>timeout</i>] <i>xpath-query
file-or-url ..</i></p>
$ 

That's a man page rendered as HTML and then selected on using XPath to obtain the list of flags in the SYNOPSIS section; using CSS selectors might be more hip these days. However, the HTML generated is not very structured.

Related Question