Linux – Split large files into a number of files in unix

awkfileslinuxsplit

i have a file that contains several lines(9074 line), i want to split the file into 10 files contain the same number of lines except the last file that contain the remainder number of lines

split -l `wc -l myfile | awk '{print $1/10}'` myfile
split: invalid number of lines: ‘907.4’

i want that the last file contain 907+4 lines

Best Answer

You have to do the integer division to get the value for the -l parameter of split. The shell is just fine for integer divisions:

lines_number=$(wc -l < file)
split -l $((lines_number / 10)) file
wc -l x*
  907 xaa
  907 xab
  907 xac
  907 xad
  907 xae
  907 xaf
  907 xag
  907 xah
  907 xai
  907 xaj
    4 xak
 9074 total

If you want to use awk for this, you have to print an integer:

wc -l file | awk '{print int($1/10)}'
907

And you will have to concatenate the last two files. Assuming that you output all of them into the same empty directory, you can do:

printf "%s\n" x* | tail -n2 | xargs cat > last_file

wc -l < last_file
911

In the above, we know that the glob matching will fetch the new files in alphabetical order, and we know that split is naming output files into that order.

Note: Also I prefer to use a custom prefix and numeric indexing for the output files of split, like this:

split -d -l 907 file new_file

Note: As the suffix length is by default 2, (see man split and -a) split would name files like new_file00, new_file01 which are again sorted alphabetically as long as they are less than 100 (3 digits suffix length). Another option to have both numerical suffixes (for human readability) and alphabetical order, is to set the suffix length -a to the appropriate value.

To include the whole process into a small script, you can do this. I have also added a check for if modulo is zero, in that case we don't modify any output files.

#!/bin/bash
f="file"
prefix="new_file"

lines_number=$(wc -l < "$f")
split -d -l $((lines_number / 10)) "$f" "$prefix"

if ((lines_number % 10 != 0)); then
    last_file=$(printf "%s\n" "$prefix"* | tail -1)
    pre_last_file=$(printf "%s\n" "$prefix"* | tail -2 | head -1)
    cat "$last_file" >> "$pre_last_file" && rm -- "$last_file"
fi

Related Solutions

Split Large File into Chunks Without Splitting Entry

Here's a solution that could work:

seq 1 $(((lines=$(wc -l </tmp/file))/16+1)) $lines |
sed 'N;s|\(.*\)\(\n\)\(.*\)|\1d;\1,\3w /tmp/uptoline\3\2\3|;P;$d;D' |
sed -ne :nl -ne '/\n$/!{N;bnl}' -nf - /tmp/file

It works by allowing the first sed to write the second sed's script. The second sed first gathers all input lines until it encounters a blank line. It then writes all output lines to a file. The first sed writes out a script for the second one instructing it on where to write its output. In my test case that script looked like this:

1d;1,377w /tmp/uptoline377
377d;377,753w /tmp/uptoline753
753d;753,1129w /tmp/uptoline1129
1129d;1129,1505w /tmp/uptoline1505
1505d;1505,1881w /tmp/uptoline1881
1881d;1881,2257w /tmp/uptoline2257
2257d;2257,2633w /tmp/uptoline2633
2633d;2633,3009w /tmp/uptoline3009
3009d;3009,3385w /tmp/uptoline3385
3385d;3385,3761w /tmp/uptoline3761
3761d;3761,4137w /tmp/uptoline4137
4137d;4137,4513w /tmp/uptoline4513
4513d;4513,4889w /tmp/uptoline4889
4889d;4889,5265w /tmp/uptoline5265
5265d;5265,5641w /tmp/uptoline5641

I tested it like this:

printf '%s\nand\nmore\nlines\nhere\n\n' $(seq 1000) >/tmp/file

This provided me a file of 6000 lines, which looked like this:

<iteration#>
and
more
lines
here
#blank

...repeated 1000 times.

After running the script above:

set -- /tmp/uptoline*
echo $# total splitfiles
for splitfile do
    echo $splitfile
    wc -l <$splitfile
    tail -n6 $splitfile
done

OUTPUT

15 total splitfiles
/tmp/uptoline1129
378
188
and
more
lines
here

/tmp/uptoline1505
372
250
and
more
lines
here

/tmp/uptoline1881
378
313
and
more
lines
here

/tmp/uptoline2257
378
376
and
more
lines
here

/tmp/uptoline2633
372
438
and
more
lines
here

/tmp/uptoline3009
378
501
and
more
lines
here

/tmp/uptoline3385
378
564
and
more
lines
here

/tmp/uptoline3761
372
626
and
more
lines
here

/tmp/uptoline377
372
62
and
more
lines
here

/tmp/uptoline4137
378
689
and
more
lines
here

/tmp/uptoline4513
378
752
and
more
lines
here

/tmp/uptoline4889
372
814
and
more
lines
here

/tmp/uptoline5265
378
877
and
more
lines
here

/tmp/uptoline5641
378
940
and
more
lines
here

/tmp/uptoline753
378
125
and
more
lines
here

Text Processing – Split File at a Pattern

With awk you can do:

awk '{print >out}; /XYZ/{out="file2"}' out=file1 largefile

Explanation: The first awk argument (out=file1) defines a variable with the filename that will be used for output while the subsequent argument (largefile) is processed. The awk program will print all lines to the file specified by the variable out ({print >out}). If the pattern XYZ will be found the output variable will be redefined to point to the new file ({out="file2}") which will be used as target to print the subsequent data lines.

References:

gawk manual: Redirection http://www.gnu.org/software/gawk/manual/html_node/Redirection.html#Redirection

Best Answer

Related Solutions

Split Large File into Chunks Without Splitting Entry

OUTPUT

Text Processing – Split File at a Pattern

Related Question