Linux – Split large files into a number of files in unix

awkfileslinuxsplit

i have a file that contains several lines(9074 line), i want to split the file into 10 files contain the same number of lines except the last file that contain the remainder number of lines

split -l `wc -l myfile | awk '{print $1/10}'` myfile
split: invalid number of lines: ‘907.4’

i want that the last file contain 907+4 lines

Best Answer

You have to do the integer division to get the value for the -l parameter of split. The shell is just fine for integer divisions:

lines_number=$(wc -l < file)
split -l $((lines_number / 10)) file
wc -l x*
  907 xaa
  907 xab
  907 xac
  907 xad
  907 xae
  907 xaf
  907 xag
  907 xah
  907 xai
  907 xaj
    4 xak
 9074 total

If you want to use awk for this, you have to print an integer:

wc -l file | awk '{print int($1/10)}'
907

And you will have to concatenate the last two files. Assuming that you output all of them into the same empty directory, you can do:

printf "%s\n" x* | tail -n2 | xargs cat > last_file

wc -l < last_file
911

In the above, we know that the glob matching will fetch the new files in alphabetical order, and we know that split is naming output files into that order.


Note: Also I prefer to use a custom prefix and numeric indexing for the output files of split, like this:

split -d -l 907 file new_file

Note: As the suffix length is by default 2, (see man split and -a) split would name files like new_file00, new_file01 which are again sorted alphabetically as long as they are less than 100 (3 digits suffix length). Another option to have both numerical suffixes (for human readability) and alphabetical order, is to set the suffix length -a to the appropriate value.


To include the whole process into a small script, you can do this. I have also added a check for if modulo is zero, in that case we don't modify any output files.

#!/bin/bash
f="file"
prefix="new_file"

lines_number=$(wc -l < "$f")
split -d -l $((lines_number / 10)) "$f" "$prefix"

if ((lines_number % 10 != 0)); then
    last_file=$(printf "%s\n" "$prefix"* | tail -1)
    pre_last_file=$(printf "%s\n" "$prefix"* | tail -2 | head -1)
    cat "$last_file" >> "$pre_last_file" && rm -- "$last_file"
fi
Related Question