I have a folder which has 250+ files of 2 GB each. I need to search for a string/pattern in those files and output the result in an output
file. I know I can run the following command, but it is too slow!!
grep mypattern * > output
I want to speed it up. Being a programmer in Java, I know multi-threading can be used for speeding up the process. I'm stuck on how to start grep
in "multi-threaded mode" and write the output into a single output
file.
Best Answer
There are two easy solutions for this. Basically, using
xargs
orparallel
.xargs Approach:
You can use
xargs
withfind
as follows:Where you will replace
number_of_processes
by the maximum number of processes you want to be launched. However, this is not guaranteed to give you a significant performance in case your performance is I/O limited. In which case you might try to start more processes to compensate for the time lost waiting for I/Os.Also, with the inclusion of find, you can specify more advanced options instead of just file patterns, like modification time, etc ...
One possible issue with this approach as explained by Stéphane's comments, if there are few files,
xargs
may not start sufficiently many processes for them. One solution will be to use the-n
option forxargs
to specify how many arguments should it take from the pipe at a time. Setting-n1
will forcexargs
to start a new process for each single file. This might be a desired behavior if the files are very large (like in the case of this question) and there is a relatively small number of files. However, if the files themselves are small, the overhead of starting a new process may undermine the advantage of parallelism, in which case a greater-n
value will be better. Thus, the-n
option might be fine tuned according to the file sizes and number.Parallel Approach:
Another way to do it is to use Ole Tange GNU Parallel tool
parallel
, (available here). This offers greater fine grain control over parallelism and can even be distributed over multiple hosts (would be beneficial if your directory is shared for example). Simplest syntax using parallel will be:find . -type f | parallel -j+1 grep mypattern
where the option
-j+1
instructs parallel to start one process in excess of the number of cores on your machine (This can be helpful for I/O limited tasks, you may even try to go higher in number).Parallel also has the advantage over
xargs
of actually retaining the order of the output from each process and generating a contiguous output. For example, withxargs
, if process 1 generates a line sayp1L1
, process 2 generates a linep2L1
, process 1 generates another linep1L2
, the output will be:whereas with
parallel
the output should be:This is usually more useful than
xargs
output.