Command-line tool to search docx files

cygwin;grepmicrosoft wordsearch

Is there a command-line tool to text-search a docx file? I tried grep, but it doesn't work with docx even though it works fine with txt and xml files. I could convert the docx to txt first, but I'd prefer a tool that operates directly on docx files. I need the tool to work under Cygwin.

OP edit: Later I found out that the easiest way to achieve the grep is actually to convert those docx to txt then grep over them.

Best Answer

I know of several indexing tools that support Word documents. Such tools allow you to index documents, then efficiently search words in the index. They don't permit full text searches.

Recoll (with Antiword and WvWare). I don't know about Cygwin support.
Lucene, with some assembly required. Works on Cygwin, I believe.
Sphinx, with any docx-to-text converter. Natively supported on Windows.
Tracker (there may still be some kinks). Cygwin support looks iffy.

Related Solutions

Grep – How to Search Text Throughout Entire File System

I normally use this style of command to run grep over a number of files:

find / -xdev -type f -print0 | xargs -0 grep -H "800x600"

What this actually does is make a list of every file on the system, and then for each file, execute grep with the given arguments and the name of each file.

The -xdev argument tells find that it must ignore other filesystems - this is good for avoiding special filesystems such as /proc. However it will also ignore normal filesystems too - so if, for example, your /home folder is on a different partition, it won't be searched - you would need to say find / /home -xdev ....

-type f means search for files only, so directories, devices and other special files are ignored (it will still recurse into directories and execute grep on the files within - it just won't execute grep on the directory itself, which wouldn't work anyway). And the -H option to grep tells it to always print the filename in its output.

find accepts all sorts of options to filter the list of files. For example, -name '*.txt' processes only files ending in .txt. -size -2M means files that are smaller than 2 megabytes. -mtime -5 means files modified in the last five days. Join these together with -a for and and -o for or, and use '(' parentheses ')' to group expressions (in quotes to prevent the shell from interpreting them). So for example:

find / -xdev '(' -type f -a -name '*.txt' -a -size -2M -a -mtime -5 ')' -print0 | xargs -0 grep -H "800x600"

Take a look at man find to see the full list of possible filters.

Convert .docx to Plain Text – Preserve Line Breaks and References

docx2txt works on the information in the docx file which is a zipped set of XML files.

With regards to line wrapping the .docx XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.

So docx2txt or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².

If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result⁴.

Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.

¹ _{This sort of incompatibility is the primary reason some of my friends cannot change to Linux from Windows, although they would like to. They have to use MicroSoft Word, as Open/LibreOffice every once in a while mangles texts they exchange with clients.}
² _{You can install all the fonts used in the Word files and might get lucky for some texts, some of the time.}
³ _{Rendering PDFs from .doc/.docx}
⁴ _{The program uses GUI automation—as if someone is clicking its menus—and doesn't attempt to drive Word via an API. I am pretty sure the latter can be done as well and would have the advantage of not breaking things if Word would get upgraded}

Best Answer

Related Solutions

Grep – How to Search Text Throughout Entire File System

Convert .docx to Plain Text – Preserve Line Breaks and References

Related Question