How to Split File by Counting Digit Numbers Within a Row

cutlinuxshellsplittext processing

I have a file with 45000 character within each line and I I want to split the original file based on specific numbers of character within a line. as an small example my input file looks like:

input.txt :

123394531112334455938383726644600000111234499922281133
234442221117273747474747474729292921111098887777772235
231112233647474838389292121037549284753930837475111013

it has 54 digit numbers within each line. I want the first 10 digits to be a separated file and 11-24 be another file. and from 25-32 digit another file and 33-50 the last file like:

out1.txt (1-10)

1233945311
2344422211
2311122336

out2.txt (11-24)

 12334455938383
 17273747474747
 47474838389292

out3.txt (25-32)

72664460
47472929
12103754

out4.txt (33-54)

0000111234499922281133
2921111098887777772235
9284753930837475111013

any suggestion please?

Best Answer

You can do this with Bash by using read and parameter substitution/expansion/splitting. The form is ${PARAMETER:OFFSET:LENGTH} where OFFSET is zero based. Save the following file as 'split', for example, then read each line by:

#!/usr/bin/env bash

# Usage: ./split "data.txt"

while IFS= read -r line
do
    printf '%s\n' "${line:0:10}"  >&3  #  1-10
    printf '%s\n' "${line:10:14}" >&4  # 11-24
    printf '%s\n' "${line:24:8}"  >&5  # 25-32
    printf '%s\n' "${line:32:22}" >&6  # 33-54
done < "$1" 3> output01.txt 4> output02.txt 5> output03.txt 6> output04.txt

# end file

~~Of course you may need to slightly adjust the positions above, but you can use this model for a lot of different types of file processing.~~ The above positions will produce the output desired. A good reference (on parameter expansion) can be found at bash-hackers.org

As a postscript, after incorporating recommended practice improvements (see comments), keep in mind that for large files the Bash approach will not be efficient in terms of cpu time and cpu resources. To quantify this statement, I've prepared a brief comparison below. First create a test file (bigsplit.txt) of the data of the opening post that is 300,000 lines in length (16500000 bytes). Then compare split, cut and awk, where the cut and awk implementations are identical the StéphaneChazelas versions. The CPU time, in seconds, is the sum of the system and user CPU times, and the RAM is the maximum used.

$ stat -c %s bigsplit.txt && wc -l bigsplit.txt 
16500000
300000 bigsplit.txt

$ ./benchmark ./split bigsplit.txt 

CPU TIME AND RESOURCE USAGE OF './split bigsplit.txt'
VALUES ARE THE AVERAGE OF ( 10 ) TRIALS

CPU, sec :   88.41
CPU, pct :   99.00
RAM, kb  : 1494.40

$ ./benchmark ./cut bigsplit.txt 

CPU TIME AND RESOURCE USAGE OF './cut bigsplit.txt'
VALUES ARE THE AVERAGE OF ( 10 ) TRIALS

CPU, sec :    0.86
CPU, pct :   99.00
RAM, kb  :  683.60

$ ./benchmark ./awk bigsplit.txt

CPU TIME AND RESOURCE USAGE OF './awk bigsplit.txt'
VALUES ARE THE AVERAGE OF ( 10 ) TRIALS

CPU, sec :    1.19
CPU, pct :   99.00
RAM, kb  : 1215.60

The comparison follows where the best performance, cut, is assigned a value of 1:

                             RELATIVE PERFORMANCE 

                                    CPU Secs     RAM kb
                                    --------     ------
                    cut                    1          1
                    awk                  1.4        1.8
                    split (Bash)       102.8        2.2

No doubt that, in this case, cut is the tool to use for larger files. From rough, preliminary tests of Bash split above, the while read from file loop accounts for about 5 seconds of the CPU time, the parameter expansion accounts for about 8 seconds and the rest can be said to be related to printf to file operations.

Related Solutions

How to split a text file based on the content into multiple text files

awk '/^4153/ {print >"CAMS1.TXT"; next} {print >"CAMS2.TXT"}' CAMS.TXT

There are other ways to do that, another would be using two grep commands

grep "^4153" CAMS.TXT > CAMS1.TXT
grep -v "^4153" CAMS.TXT > CAMS2.TXT

That's less efficient but easier to type, after the first grep is done, you recall it from your shell history (using the "up" arrow key) and makes a few changes. Of course the file is read two times, so don't do that if it is huge.

Splitting file for every 10000 numbers ( not lines )

awk 'NR==1 {n=$2}
     {
       file = sprintf("file.%.4d", ($2-n)/10000)
       if (file != last_file) {
         close(last_file)
         last_file = file
       }
       print > file
     }'

Would write to file.0000, file.0001... (the number being int(($2-n)/10000) where n is $2 for the first line).

Note that we close files once we've stopped writing to them as otherwise, you'd reach the limit on the number of simultaneously open files after a few hundred files (GNU awk can work around that limit, but then the performances degrade quickly).

We're assuming those numbers are always going up.

Best Answer

Related Solutions

How to split a text file based on the content into multiple text files

Splitting file for every 10000 numbers ( not lines )

Related Question