Ubuntu – How to split a large file

bashcommand lineregex

How can I pass all files in a list.txt which looks like

file1_1.txt
file1_2.txt
file1_3.txt
file1_4.txt

to the split command (there are other txt files in the dir) and make it return the parts numbered as in (assume file1_1.txt would be chunked into 4 parts)

file1_1_1.txt;
file1_1_2.txt;
file1_1_3.txt;
file1_1_4.txt 

and so on….

Best Answer

In order to be sure no line break you can use other option than c.

split -n #number

split a file by #number chunks so you can thus be sure they are equal and more no break in lines.Example

split -n 4 file.txt

This will split the file.txt into 4 chunks.

OR you can split by number of lines

split -l #number-of-lines

example

split -l 200 file.txt

This will split the file.txt into files each is 200 lines. But this is not accurate that all are of same length since the last file maybe less than that number chosen.

Now with respect to the naming. The default of the command split is to use a default suffix "x" and default prefixes "aa" "ab" "ac" ...

So in order to change those default you have to state a suffix to use.

in your case you can use

split -n #number file1_1.txt file1_1.

The output would be like file1_1.aa file1_1.ab file1_1.ac

Else you can change the default prefix to be numberic using -d

  split -n #number -d file1_1.txt file1_1.

The output would be like file1_1.00 file1_1.01 file1_1.02

So you can't get the naming you want with default split unless you use some awk or sed with REGEX.

Now to read the set of files from a file lets called files.list

while IFS= read -r file

do split -n #number -d "$file" "$file"

done < files.list

UPDATE: I came to this https://unix.stackexchange.com/questions/32626/split-a-file-by-line-and-have-control-over-resulting-files-extension

According to second answer

recent versions (≥ 8.16) of gnu split one can use the --additional-suffix switch to have control over the resulting extension. From man split:

--additional-suffix=SUFFIX append an additional SUFFIX to file names. so when using that option:

split -dl 10000 --additional-suffix=.txt words wrd the resulting pieces will automatically end in .txt:

wrd00.txt wrd01.txt

So in your case if your split version is >8.16 then you can do your need like this

split -n #number -d --additional-suffix=.txt file1_1.txt file1_1-

So output would be like

file1_1-00.txt file1_1-01.txt file1_1-02.txt .....