Split text file into lines with fixed number of words

awksedsplittext processing

Related, but no satisfactory answers: How can I split a large text file into chunks of 500 words or so?

I'm trying to take a text file (http://mattmahoney.net/dc/text8.zip) with > 10^7 words all in one line, and split it into lines with N words each. My current approach works, but is fairly slow and ugly (using shell script):

i=0
for word in $(sed -e 's/\s\+/\n/g' input.txt)
do
    echo -n "${word} " > output.txt
    let "i=i+1"

    if [ "$i" -eq "1000" ]
    then
        echo > output.txt
        let "i=0"
    fi
done

Any tips on how I can make this faster or more compact?

Best Answer

Assuming your definition of word is a sequence of non-blank characters separated by blanks, here's an awk solution for your single-line file

awk '{for (i=1; i<=NF; ++i)printf "%s%s", $i, i % 500? " ": "\n"}i % 500{print ""}' file
Related Question