Splitting file for every 10000 numbers ( not lines )

awksplittext processing

I have a file that looks like the following:

chr19   61336212        +       0       0       CG      CGT    
chr19   61336213        -       0       0       CG      CGG    
chr19   61336218        +       0       0       CG      CGG    
chr19   61336219        -       0       0       CG      CGC    
chr19   61336268        +       0       0       CG      CGG    
chr19   61336269        -       0       0       CG      CGA    
chr19   61336402        +       0       0       CG      CGG    
chr19   61336403        -       0       0       CG      CGT

I want to split this file for every 10000 interval of the 2nd field(NOT lines, but number interval). So for this file I would like to split from the first line( the line with 61336212) to the line that has or up to 61346211 ( 61336212+9999), then from 61346212 to 61356211, and so on and so forth. As you can see the numbers in 2nd field/column is not 'filled'.

Is there a way to do this?

Best Answer

awk 'NR==1 {n=$2}
     {
       file = sprintf("file.%.4d", ($2-n)/10000)
       if (file != last_file) {
         close(last_file)
         last_file = file
       }
       print > file
     }'

Would write to file.0000, file.0001... (the number being int(($2-n)/10000) where n is $2 for the first line).

Note that we close files once we've stopped writing to them as otherwise, you'd reach the limit on the number of simultaneously open files after a few hundred files (GNU awk can work around that limit, but then the performances degrade quickly).

We're assuming those numbers are always going up.

`join`

Assuming you have a data file you want to extract rows from and a line_numbers file that lists the numbers of the rows you want to extract, if the sorting order of the output is not important you can use:

join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-

This will number the lines of your data file, join it with the padded_line_numbers file on the first field (the default) and print out the common lines (excluding the join field itself, that is cut away).

join needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers file has to be prepared by left-padding each line of your line_numbers file. E.g.:

while read rownum; do
    printf '%.12d\n' "$rownum"
done <line_numbers >padded_line_numbers

The -w 12 -n rz options and arguments instruct nl to output 12 digits long numbers with leading zeros.

If the sorting order of the output has to match that of your line_numbers file, you can use:

join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2) \
    <(nl -w 12 -n rz data) |
    sort -k 2,2n |
    cut -d ' ' -f 3-

Where we are numbering the padded_line_numbers file, sorting the result alphabetically by its second field, joining it with the numbered data file and numerically sorting the result by the original sorting order of padded_line_numbers.

Process substitution is here used for convenience. If you can not or do not want to rely on it and, as it is likely, you are not willing to waste the storage needed for creating regular files to hold intermediate results, you can leverage named pipes:

mkfifo padded_line_numbers
mkfifo numbered_data

while read rownum; do
    printf '%.12d\n' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &

nl -w 12 -n rz data >numbered_data &

join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-

Benchmarking

Since the peculiarity of your question is the number of rows in your data file, I thought it could be useful to test alternative approaches with a comparable amount of data.

For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc, hex-encoded using od -An -tx1 -w2 and with spaces removed with tr -d ' ':

$ head -n 3 data
c15d
061d
5787

$ wc -l data
3221254963 data

The line_numbers file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf from GNU Coreutils:

shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers

The testing environment was a laptop with a i7-2670QM Intel quad-core processor, 16 GiB of memory, SSD storage, GNU/Linux, bash 5.0 and GNU tools.
The only dimension I measured has been the execution time, by means of the time shell builtin.

Here I'm considering:

The sed solution from Weijun Zhou's answer.
The awk solution from Micha's answer.
The perl solution from wurtel's answer.
The join solution above.

perl seems to be the fastest:

$ time perl_script line_numbers data | wc -l
10000

real    14m51.597s
user    14m41.878s
sys     0m9.299s

awk's performance looks comparable:

$ time awk 'FNR==NR { seen[$0]++ }; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000

real    29m3.808s
user    28m52.616s
sys     0m10.709s

join, too, appears to be comparable:

$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000

real    28m24.053s
user    27m52.857s
sys     0m28.958s

Note that the sorted version mentioned above has roughly no performance penalty over this one.

Finally, sed appears to be significantly slower: I killed it after approximately nine hours:

$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C

real    551m12.747s
user    550m53.390s
sys     0m15.624s

Best Answer

Related Solutions

Fast way to extract lines from a large file based on line numbers stored in another file

join

Benchmarking

Related Question

`join`