Spreading stdin to parallel processes

parallelismsplitxargs

I have a task that processes a list of files on stdin. The start-up time of the program is substantial, and the amount of time each file takes varies widely. I want to spawn a substantial number of these processes, then dispatch work to whichever ones are not busy. There are several different commandline tools that almost do what I want, I've narrowed it down to two almost working options:

find . -type f | split -n r/24 -u --filter="myjob"
find . -type f | parallel --pipe -u -l 1 myjob

The problem is that split does a pure round-robin, so one of the processes gets behind and stays behind, delaying the completion of the entire operation; while parallel wants to spawn one process per N lines or bytes of input and I wind up spending way too much time on startup overhead.

Is there something like this that will re-use the processes and feed lines to whichever processes have unblocked stdins?

Best Answer

For GNU Parallel you can set the block size using --block. It does, however, require you have enough memory to keep 1 block in memory for each of the running processes.

I understand this is not precisely what you are looking for, but it may be an acceptable work-around for now.

If your tasks on average take the same time, then you might be able to use mbuffer:

find . -type f | split -n r/24 -u --filter="mbuffer -m 2G | myjob"
Related Question