How to start multi-threaded grep in terminal

grepparallelism

I have a folder which has 250+ files of 2 GB each. I need to search for a string/pattern in those files and output the result in an output file. I know I can run the following command, but it is too slow!!

grep mypattern * > output

I want to speed it up. Being a programmer in Java, I know multi-threading can be used for speeding up the process. I'm stuck on how to start grep in "multi-threaded mode" and write the output into a single output file.

Best Answer

There are two easy solutions for this. Basically, using xargs or parallel.

xargs Approach:

You can use xargs with find as follows:

find . -type f -print0  | xargs -0 -P number_of_processes grep mypattern > output

Where you will replace number_of_processes by the maximum number of processes you want to be launched. However, this is not guaranteed to give you a significant performance in case your performance is I/O limited. In which case you might try to start more processes to compensate for the time lost waiting for I/Os.

Also, with the inclusion of find, you can specify more advanced options instead of just file patterns, like modification time, etc ...

One possible issue with this approach as explained by Stéphane's comments, if there are few files, xargs may not start sufficiently many processes for them. One solution will be to use the -n option for xargs to specify how many arguments should it take from the pipe at a time. Setting -n1 will force xargs to start a new process for each single file. This might be a desired behavior if the files are very large (like in the case of this question) and there is a relatively small number of files. However, if the files themselves are small, the overhead of starting a new process may undermine the advantage of parallelism, in which case a greater -n value will be better. Thus, the -n option might be fine tuned according to the file sizes and number.

Parallel Approach:

Another way to do it is to use Ole Tange GNU Parallel tool parallel, (available here). This offers greater fine grain control over parallelism and can even be distributed over multiple hosts (would be beneficial if your directory is shared for example). Simplest syntax using parallel will be:

find . -type f | parallel -j+1 grep mypattern

where the option -j+1 instructs parallel to start one process in excess of the number of cores on your machine (This can be helpful for I/O limited tasks, you may even try to go higher in number).

Parallel also has the advantage over xargs of actually retaining the order of the output from each process and generating a contiguous output. For example, with xargs, if process 1 generates a line say p1L1, process 2 generates a line p2L1, process 1 generates another line p1L2, the output will be:

p1L1
p2L1
p1L2

whereas with parallel the output should be:

p1L1
p1L2
p2L1

This is usually more useful than xargs output.

Related Solutions

Grep – Pattern Matching with Dashes and Filename Extension Restriction

The problem is that -R tells grep to recursively search through all files in the directory. So, you can't combine it with a specific group of files. Therefore, you can either use find as suggested by @KM., or shell globbing:

$ shopt -s globstar
$ grep -- "->-" **/*.tex

The shopt command activates bash's globstar feature:

globstar
                  If set, the pattern ** used in a pathname expansion con‐
                  text will match all files and zero or  more  directories
                  and  subdirectories.  If the pattern is followed by a /,
                  only directories and subdirectories match.

You then give **/*.tex as a pattern and that will match all .tex files in the current directory and any subdirectories.

If you're using zsh, there's no need for the shopt (which is a bash feature anyway) since zsh can do this by default.

Grep first 50 lines of files for pattern

awk (assuming your implementation supports the nextfile statement) can do this quite nicely:

awk 'FNR > 50 { nextfile }; /foobar/ { print FILENAME ": " $0 }' ./*.sql

The first statement skips to the next file once 50 records have been processed. The second statement prints the filename and the matching line for any line containing foobar.

If your awk doesn't have nextfile this variant works too, although I imagine it will be less efficient:

awk 'FNR <= 50 && /foobar/ { print FILENAME ": " $0 }' ./*.sql

Best Answer

Related Solutions

Grep – Pattern Matching with Dashes and Filename Extension Restriction

Grep first 50 lines of files for pattern

Related Question