AWK – How to Sum a Column Over a Specified Number of Lines

awknumeric data

I've reviewed the "Similar questions", and none seem to solve my problem:

I have a large CSV input file; each line in the file is an x,y data point. Here are a few lines for illustration, but please note that in general the data are not monotonic:

1.904E-10,2.1501E+00  
3.904E-10,2.1827E+00  
5.904E-10,2.1106E+00  
7.904E-10,2.2311E+00  
9.904E-10,2.2569E+00  
1.1904E-09,2.3006E+00  

I need to create an output file that is smaller than the input file. The output file will contain no more than one line for every N lines in the input file. Each single line in the output file will be a x,y data point which is the average of the x,y values for N lines of the input file.

For example, if the total number of lines in the input file is 3,000, and N=3, the output file will contain no more than 1,000 lines. Using the data above to complete this example, the first 3 lines of data above would be replaced with a single line as follows:

x = (1.904E-10 + 3.904E-10 + 5.904E-10) / 3 = 3.904E-10

y = (2.1501E+00 + 2.1827E+00 + 2.1106E+00) / 3 = 2.1478E+00, or:

3.904E-10,2.1478E+00 

for one line of the output file.

I've fiddled with this for a while, but haven't gotten it right. This is what I've been working with, but I can't see how to iterate the NR value to work through the entire file:

awk -F ',' 'NR == 1, NR == 3 {sumx += $1; avgx = sumx / 3; sumy += $2; avgy = sumy / 3} END {print avgx, avgy}' CB07-Small.csv

To complicate this a bit more, I need to "thin" my output file still further:

If the value of avgy (as calculated above) is close to the last value of avgy in the output file, I will not add this as a new data point to the output file. Instead I will calculate the next avgx & avgy values from the next N lines of the input file. "Close" should be defined as a percentage of the last value of argy. For example:

if the current calculated value of avgy differs by less than 10% from the last value of avgy recorded in the output file, then do not write a new value to the output file.

see edit history

Best Answer

Here’s a generic variant:

BEGIN { OFS = FS = "," }

{
    for (i = 1; i <= NF; i++) sum[i] += $i
    count++
}

count % 3 == 0 {
    for (i = 1; i <= NF; i++) $i = sum[i] / count
    delete sum
    count = 0
    if ($NF >= 1.1 * last || $NF <= 0.9 * last) {
        print
        last = $NF
    }
}


END {
    if (count > 0) {
        for (i = 1; i <= NF; i++) $i = sum[i] / count
        if ($NF >= 1.1 * last || $NF <= 0.9 * last) print
    }
}

I’m assuming that left-overs should be handled in a similar fashion to blocks of N lines.

Related Question