Split text file into lines with fixed number of words

awksedsplittext processing

Related, but no satisfactory answers: How can I split a large text file into chunks of 500 words or so?

I'm trying to take a text file (http://mattmahoney.net/dc/text8.zip) with > 10^7 words all in one line, and split it into lines with N words each. My current approach works, but is fairly slow and ugly (using shell script):

i=0
for word in $(sed -e 's/\s\+/\n/g' input.txt)
do
    echo -n "${word} " > output.txt
    let "i=i+1"

    if [ "$i" -eq "1000" ]
    then
        echo > output.txt
        let "i=0"
    fi
done

Any tips on how I can make this faster or more compact?

Best Answer

Assuming your definition of word is a sequence of non-blank characters separated by blanks, here's an awk solution for your single-line file

awk '{for (i=1; i<=NF; ++i)printf "%s%s", $i, i % 500? " ": "\n"}i % 500{print ""}' file

Related Solutions

How to split a large text file into chunks of 500 words or so

Must it be done with wc? Because here I've ran into a very nice attempt to use regex as a csplit pattern. I don't have a system to test it right now but the regex itself seem to do the job.

The expression looks like that:

 csplit input-file.txt '/([\w.,;]+\s+){500}/'

Split text file into short lines for reading

The command I think you're looking for is called fmt.

$ fmt loremipsum.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam vel
lectus ac enim venenatis porttitor in et est. Curabitur ut eros quis risus
consequat dictum a a lectus. Integer ut risus quis augue lobortis molestie
vel id nibh. Aliquam sit amet mattis lorem, vel ornare felis. Donec
pulvinar tempus lorem, at porta sem pretium ut. Cras ut lorem tincidunt,
scelerisque nunc vitae, posuere augue. Vestibulum iaculis libero id congue
ultrices. Nullam mauris ipsum, aliquet eget nisl non, venenatis euismod
enim. Phasellus a eleifend velit. Aenean molestie venenatis turpis,
consectetur convallis velit fringilla non.

You can control the results, such as width, etc.

$ fmt --help
Usage: fmt [-WIDTH] [OPTION]... [FILE]...
Reformat each paragraph in the FILE(s), writing to standard output.
The option -WIDTH is an abbreviated form of --width=DIGITS.

Mandatory arguments to long options are mandatory for short options too.
  -c, --crown-margin        preserve indentation of first two lines
  -p, --prefix=STRING       reformat only lines beginning with STRING,
                              reattaching the prefix to reformatted lines
  -s, --split-only          split long lines, but do not refill
  -t, --tagged-paragraph    indentation of first line different from second
  -u, --uniform-spacing     one space between words, two after sentences
  -w, --width=WIDTH         maximum line width (default of 75 columns)
      --help     display this help and exit
      --version  output version information and exit

With no FILE, or when FILE is -, read standard input.

Best Answer

Related Solutions

How to split a large text file into chunks of 500 words or so

Split text file into short lines for reading

Related Question