Bash – Split File and Run Each Piece as Param to Script in Parallel

bashgnu-parallelparallelismscriptingsplit

I have a words.txt with 10000 words (one to a line). I have 5,000 documents. I want to see which documents contain which of those words (with a regex pattern around the word). I have a script.sh that greps the documents and outputs hits. I want to (1) split my input file into smaller files (2) feed each of the files to script.sh as a parameter and (3) run all of this in parallel.

My attempt based on the tutorial is hitting errors

$parallel ./script.sh ::: split words.txt # ./script.sh: line 22: split: No such file or directory

My script.sh looks like this

#!/usr/bin/env bash

line 1 while read line
line 2  do
        some stuff
line 22 done < $1

I guess I could output split to a directory loop thru the files in the directory launching grep commands — but how can do this elegantly and concisely (using parallel)?

Best Answer

You can use the split tool:

split -l 1000 words.txt words-

will split your words.txt file into files with no more than 1000 lines each named

words-aa
words-ab
words-ac
...
words-ba
words-bb
...

If you omit the prefix (words- in the above example), split uses x as the default prefix.

For using the generated files with parallel you can make use of a glob:

split -l 1000 words.txt words-
parallel ./script.sh ::: words-[a-z][a-z]
Related Question