How to start multi-threaded grep in terminal

grepparallelism

I have a folder which has 250+ files of 2 GB each. I need to search for a string/pattern in those files and output the result in an output file. I know I can run the following command, but it is too slow!!

grep mypattern * > output

I want to speed it up. Being a programmer in Java, I know multi-threading can be used for speeding up the process. I'm stuck on how to start grep in "multi-threaded mode" and write the output into a single output file.

Best Answer

There are two easy solutions for this. Basically, using xargs or parallel.

xargs Approach:

You can use xargs with find as follows:

find . -type f -print0  | xargs -0 -P number_of_processes grep mypattern > output

Where you will replace number_of_processes by the maximum number of processes you want to be launched. However, this is not guaranteed to give you a significant performance in case your performance is I/O limited. In which case you might try to start more processes to compensate for the time lost waiting for I/Os.

Also, with the inclusion of find, you can specify more advanced options instead of just file patterns, like modification time, etc ...

One possible issue with this approach as explained by Stéphane's comments, if there are few files, xargs may not start sufficiently many processes for them. One solution will be to use the -n option for xargs to specify how many arguments should it take from the pipe at a time. Setting -n1 will force xargs to start a new process for each single file. This might be a desired behavior if the files are very large (like in the case of this question) and there is a relatively small number of files. However, if the files themselves are small, the overhead of starting a new process may undermine the advantage of parallelism, in which case a greater -n value will be better. Thus, the -n option might be fine tuned according to the file sizes and number.

Parallel Approach:

Another way to do it is to use Ole Tange GNU Parallel tool parallel, (available here). This offers greater fine grain control over parallelism and can even be distributed over multiple hosts (would be beneficial if your directory is shared for example). Simplest syntax using parallel will be:

find . -type f | parallel -j+1 grep mypattern

where the option -j+1 instructs parallel to start one process in excess of the number of cores on your machine (This can be helpful for I/O limited tasks, you may even try to go higher in number).

Parallel also has the advantage over xargs of actually retaining the order of the output from each process and generating a contiguous output. For example, with xargs, if process 1 generates a line say p1L1, process 2 generates a line p2L1, process 1 generates another line p1L2, the output will be:

p1L1
p2L1
p1L2

whereas with parallel the output should be:

p1L1
p1L2
p2L1

This is usually more useful than xargs output.

Related Question