Bash – Split File and Run Each Piece as Param to Script in Parallel

bashgnu-parallelparallelismscriptingsplit

I have a words.txt with 10000 words (one to a line). I have 5,000 documents. I want to see which documents contain which of those words (with a regex pattern around the word). I have a script.sh that greps the documents and outputs hits. I want to (1) split my input file into smaller files (2) feed each of the files to script.sh as a parameter and (3) run all of this in parallel.

My attempt based on the tutorial is hitting errors

$parallel ./script.sh ::: split words.txt # ./script.sh: line 22: split: No such file or directory

My script.sh looks like this

#!/usr/bin/env bash

line 1 while read line
line 2  do
        some stuff
line 22 done < $1

I guess I could output split to a directory loop thru the files in the directory launching grep commands — but how can do this elegantly and concisely (using parallel)?

Best Answer

You can use the split tool:

split -l 1000 words.txt words-

will split your words.txt file into files with no more than 1000 lines each named

words-aa
words-ab
words-ac
...
words-ba
words-bb
...

If you omit the prefix (words- in the above example), split uses x as the default prefix.

For using the generated files with parallel you can make use of a glob:

split -l 1000 words.txt words-
parallel ./script.sh ::: words-[a-z][a-z]

Related Solutions

Bash – How to run x instances of a script parallel

Using GNU Parallel it looks like this:

parallel script1.sh {}';' script2.sh {} ::: a b c ::: d e f

It will spawn one job per CPU.

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

BASH: parallel run

Summary of the comments: The machine is fast but doesn't have enough memory to run everything in parallel. In addition the problem needs to read a lot of data and the disk bandwidth is not enough, so the cpus are idle most of the time waiting for data.

Rearranging the tasks helps.

Not yet investigated compressing the data to see if it can improve the effective disk I/O bandwidth.

Best Answer

Related Solutions

Bash – How to run x instances of a script parallel

BASH: parallel run

Related Question