I need to run grep
on a couple of million files. Therefore I tried to speed it up, following the two approaches mentioned here: xargs -P -n
and GNU parallel
. I tried this on a subset of my files (9026 in number), and this was the result:
-
With
xargs -P 8 -n 1000
, very fast:$ time find tex -maxdepth 1 -name "*.json" | \ xargs -P 8 -n 1000 grep -ohP "'pattern'" > /dev/null real 0m0.085s user 0m0.333s sys 0m0.058s
-
With
parallel
, very slow:$ time find tex -maxdepth 1 -name "*.json" | \ parallel -j 8 grep -ohP "'pattern'" > /dev/null real 0m21.566s user 0m22.021s sys 0m18.505s
-
Even sequential
xargs
is faster thanparallel
:$ time find tex -maxdepth 1 -name "*.json" | \ xargs grep -ohP 'pattern' > /dev/null real 0m0.242s user 0m0.209s sys 0m0.040s
xargs -P n
does not work for me because the output from all the processes gets interleaved, which does not happen with parallel
. So I would like to use parallel
without incurring this huge slowdown.
Any ideas?
UPDATE
-
Following the answer by Ole Tange, I tried
parallel -X
, the results are here, for completeness:$ time find tex -maxdepth 1 -name "*.json" | \ parallel -X -j 8 grep -ohP "'pattern'" > /dev/null real 0m0.563s user 0m0.583s sys 0m0.110s
-
Fastest solution: Following the comment by @cas, I tried to grep with
-H
option (to force printing the filenames), and sorting. Results here:time find tex -maxdepth 1 -name '*.json' -print0 | \ xargs -0r -P 9 -n 500 grep --line-buffered -oHP 'pattern' | \ sort -t: -k1 | cut -d: -f2- > /dev/null real 0m0.144s user 0m0.417s sys 0m0.095s
Best Answer
Try
parallel -X
. As written in the comments the overhead of starting a new shell and opening files for buffering for each argument is probably the cause.Be aware that GNU Parallel will never be as fast as xargs because of that. Expect an overhead of 10 ms per job. With -X this overhead is less significant as you process more arguments in one job.