That would be GROFF.
(Older Unix version use NROFF or TROFF, but they are all different programs in the same family.)
Please note that it is not a simple command that operates on a text-file. It requires formatting instructions to be inserted in the text.
In fact it operates much more like a compiler.
If you are familiar with TeX or LaTeX than you know the principle, but the various ROFF style programms have a much simpler command language.
docx2txt
works on the information in the docx
file which is a zipped set of XML files.
With regards to line wrapping the .docx
XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt
normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx
contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt
via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.
So docx2txt
or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².
If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result⁴.
Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt
largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.
¹ This sort of incompatibility is the primary reason some of my friends cannot change to Linux from Windows, although they would like to. They have to use MicroSoft Word, as Open/LibreOffice every once in a while mangles texts they exchange with clients.
² You can install all the fonts used in the Word files and might get lucky for some texts, some of the time.
³ Rendering PDFs from .doc/.docx
⁴ The program uses GUI automation—as if someone is clicking its menus—and doesn't attempt to drive Word via an API. I am pretty sure the latter can be done as well and would have the advantage of not breaking things if Word would get upgraded
Best Answer
I believe you might be able to use pandoc. It reads from stdin and can output to a variety of formats