Splitting file for every 10000 numbers ( not lines )

awksplittext processing

I have a file that looks like the following:

chr19   61336212        +       0       0       CG      CGT    
chr19   61336213        -       0       0       CG      CGG    
chr19   61336218        +       0       0       CG      CGG    
chr19   61336219        -       0       0       CG      CGC    
chr19   61336268        +       0       0       CG      CGG    
chr19   61336269        -       0       0       CG      CGA    
chr19   61336402        +       0       0       CG      CGG    
chr19   61336403        -       0       0       CG      CGT    

I want to split this file for every 10000 interval of the 2nd field(NOT lines, but number interval). So for this file I would like to split from the first line( the line with 61336212) to the line that has or up to 61346211 ( 61336212+9999), then from 61346212 to 61356211, and so on and so forth. As you can see the numbers in 2nd field/column is not 'filled'.

Is there a way to do this?

Best Answer

awk 'NR==1 {n=$2}
     {
       file = sprintf("file.%.4d", ($2-n)/10000)
       if (file != last_file) {
         close(last_file)
         last_file = file
       }
       print > file
     }'

Would write to file.0000, file.0001... (the number being int(($2-n)/10000) where n is $2 for the first line).

Note that we close files once we've stopped writing to them as otherwise, you'd reach the limit on the number of simultaneously open files after a few hundred files (GNU awk can work around that limit, but then the performances degrade quickly).

We're assuming those numbers are always going up.

Related Question