Split Large File into Chunks Without Splitting Entry

splittext processing

I have a rather large .msg file formatted in the UIEE format.

$ wc -l big_db.msg
8726593 big_db.msg

Essentially, the file is made up of entries of various length that look something like this:

UR|1
AA|Condon, Richard
TI|Prizzi's Family
CN|Collectable- Good/Good
MT|FICTION
PU|G.P. Putnam & Sons
DP|1986
ED|First Printing.
BD|Hard Cover
NT|0399132104
KE|MAFIA
KE|FICTION
PR|44.9
XA|4
XB|1
XC|BO
XD|S

UR|10
AA|Gariepy, Henry
TI|Portraits of Perseverance
CN|Good/No Jacket
MT|SOLD
PU|Victor Books
DP|1989
BD|Mass Market Paperback
NT|1989 tpb g 100 meditations from the Book of Job "This book...help you
NT| persevere through the struggles of your life..."
KE|Bible
KE|religion
KE|Job
KE|meditations
PR|28.4
XA|4
XB|5
XC|BO
XD|S

This is an examples of two entries, separated by a blank line. I wish to split this big file into smaller files without breaking an entry into two files.

Each individual entry is separated by a newline (a completely blank line) in the file. I wish to break this 8.7 million line file into 15 files. I understand that tools like split exist but I'm not quite sure how to split the file but only have it split on a newline so a single entry doesn't get broken into multiple files.

Best Answer

Here's a solution that could work:

seq 1 $(((lines=$(wc -l </tmp/file))/16+1)) $lines |
sed 'N;s|\(.*\)\(\n\)\(.*\)|\1d;\1,\3w /tmp/uptoline\3\2\3|;P;$d;D' |
sed -ne :nl -ne '/\n$/!{N;bnl}' -nf - /tmp/file

It works by allowing the first sed to write the second sed's script. The second sed first gathers all input lines until it encounters a blank line. It then writes all output lines to a file. The first sed writes out a script for the second one instructing it on where to write its output. In my test case that script looked like this:

1d;1,377w /tmp/uptoline377
377d;377,753w /tmp/uptoline753
753d;753,1129w /tmp/uptoline1129
1129d;1129,1505w /tmp/uptoline1505
1505d;1505,1881w /tmp/uptoline1881
1881d;1881,2257w /tmp/uptoline2257
2257d;2257,2633w /tmp/uptoline2633
2633d;2633,3009w /tmp/uptoline3009
3009d;3009,3385w /tmp/uptoline3385
3385d;3385,3761w /tmp/uptoline3761
3761d;3761,4137w /tmp/uptoline4137
4137d;4137,4513w /tmp/uptoline4513
4513d;4513,4889w /tmp/uptoline4889
4889d;4889,5265w /tmp/uptoline5265
5265d;5265,5641w /tmp/uptoline5641

I tested it like this:

printf '%s\nand\nmore\nlines\nhere\n\n' $(seq 1000) >/tmp/file

This provided me a file of 6000 lines, which looked like this:

<iteration#>
and
more
lines
here
#blank

...repeated 1000 times.

After running the script above:

set -- /tmp/uptoline*
echo $# total splitfiles
for splitfile do
    echo $splitfile
    wc -l <$splitfile
    tail -n6 $splitfile
done    

OUTPUT

15 total splitfiles
/tmp/uptoline1129
378
188
and
more
lines
here

/tmp/uptoline1505
372
250
and
more
lines
here

/tmp/uptoline1881
378
313
and
more
lines
here

/tmp/uptoline2257
378
376
and
more
lines
here

/tmp/uptoline2633
372
438
and
more
lines
here

/tmp/uptoline3009
378
501
and
more
lines
here

/tmp/uptoline3385
378
564
and
more
lines
here

/tmp/uptoline3761
372
626
and
more
lines
here

/tmp/uptoline377
372
62
and
more
lines
here

/tmp/uptoline4137
378
689
and
more
lines
here

/tmp/uptoline4513
378
752
and
more
lines
here

/tmp/uptoline4889
372
814
and
more
lines
here

/tmp/uptoline5265
378
877
and
more
lines
here

/tmp/uptoline5641
378
940
and
more
lines
here

/tmp/uptoline753
378
125
and
more
lines
here
Related Question