I normally use this style of command to run grep
over a number of files:
find / -xdev -type f -print0 | xargs -0 grep -H "800x600"
What this actually does is make a list of every file on the system, and then for each file, execute grep
with the given arguments and the name of each file.
The -xdev
argument tells find that it must ignore other filesystems - this is good for avoiding special filesystems such as /proc
. However it will also ignore normal filesystems too - so if, for example, your /home folder is on a different partition, it won't be searched - you would need to say find / /home -xdev ...
.
-type f
means search for files only, so directories, devices and other special files are ignored (it will still recurse into directories and execute grep
on the files within - it just won't execute grep
on the directory itself, which wouldn't work anyway). And the -H
option to grep
tells it to always print the filename in its output.
find
accepts all sorts of options to filter the list of files. For example, -name '*.txt'
processes only files ending in .txt. -size -2M
means files that are smaller than 2 megabytes. -mtime -5
means files modified in the last five days. Join these together with -a for and and -o for or, and use '('
parentheses ')'
to group expressions (in quotes to prevent the shell from interpreting them). So for example:
find / -xdev '(' -type f -a -name '*.txt' -a -size -2M -a -mtime -5 ')' -print0 | xargs -0 grep -H "800x600"
Take a look at man find
to see the full list of possible filters.
docx2txt
works on the information in the docx
file which is a zipped set of XML files.
With regards to line wrapping the .docx
XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt
normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx
contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt
via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.
So docx2txt
or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².
If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result⁴.
Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt
largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.
¹ This sort of incompatibility is the primary reason some of my friends cannot change to Linux from Windows, although they would like to. They have to use MicroSoft Word, as Open/LibreOffice every once in a while mangles texts they exchange with clients.
² You can install all the fonts used in the Word files and might get lucky for some texts, some of the time.
³ Rendering PDFs from .doc/.docx
⁴ The program uses GUI automation—as if someone is clicking its menus—and doesn't attempt to drive Word via an API. I am pretty sure the latter can be done as well and would have the advantage of not breaking things if Word would get upgraded
Best Answer
I know of several indexing tools that support Word documents. Such tools allow you to index documents, then efficiently search words in the index. They don't permit full text searches.