AWK – How to Sum a Column Over a Specified Number of Lines

awknumeric data

I've reviewed the "Similar questions", and none seem to solve my problem:

I have a large CSV input file; each line in the file is an x,y data point. Here are a few lines for illustration, but please note that in general the data are not monotonic:

1.904E-10,2.1501E+00  
3.904E-10,2.1827E+00  
5.904E-10,2.1106E+00  
7.904E-10,2.2311E+00  
9.904E-10,2.2569E+00  
1.1904E-09,2.3006E+00

I need to create an output file that is smaller than the input file. The output file will contain no more than one line for every N lines in the input file. Each single line in the output file will be a x,y data point which is the average of the x,y values for N lines of the input file.

For example, if the total number of lines in the input file is 3,000, and N=3, the output file will contain no more than 1,000 lines. Using the data above to complete this example, the first 3 lines of data above would be replaced with a single line as follows:

x = (1.904E-10 + 3.904E-10 + 5.904E-10) / 3 = 3.904E-10

y = (2.1501E+00 + 2.1827E+00 + 2.1106E+00) / 3 = 2.1478E+00, or:

3.904E-10,2.1478E+00

for one line of the output file.

I've fiddled with this for a while, but haven't gotten it right. This is what I've been working with, but I can't see how to iterate the NR value to work through the entire file:

awk -F ',' 'NR == 1, NR == 3 {sumx += $1; avgx = sumx / 3; sumy += $2; avgy = sumy / 3} END {print avgx, avgy}' CB07-Small.csv

To complicate this a bit more, I need to "thin" my output file still further:

If the value of avgy (as calculated above) is close to the last value of avgy in the output file, I will not add this as a new data point to the output file. Instead I will calculate the next avgx & avgy values from the next N lines of the input file. "Close" should be defined as a percentage of the last value of argy. For example:

if the current calculated value of avgy differs by less than 10% from the last value of avgy recorded in the output file, then do not write a new value to the output file.

_{see edit history}

Best Answer

Here’s a generic variant:

BEGIN { OFS = FS = "," }

{
    for (i = 1; i <= NF; i++) sum[i] += $i
    count++
}

count % 3 == 0 {
    for (i = 1; i <= NF; i++) $i = sum[i] / count
    delete sum
    count = 0
    if ($NF >= 1.1 * last || $NF <= 0.9 * last) {
        print
        last = $NF
    }
}


END {
    if (count > 0) {
        for (i = 1; i <= NF; i++) $i = sum[i] / count
        if ($NF >= 1.1 * last || $NF <= 0.9 * last) print
    }
}

I’m assuming that left-overs should be handled in a similar fashion to blocks of N lines.

How it works

-v OFS='\t'

Optional: this sets the output to tab-separated.
{for (i=1;i<=NF;i++) {s[2-NR%2,i]+=$i;s[3,i]+=$i;}; $1=$1; print}

This loops through each column, adding its values to the array s. For each column i, even numbered rows are added to s[2,i] while odd-numbered rows are added to s[1,i]. Column i on all rows is added to s[3,i].

This row is then printed.
END{for (n=1;n<=3;n++) print s[n,1],s[n,2],s[n,3]}

After we have reached the end of the file, the results are printed, first for the odd-numbered lines (n=1), then the even-numbered lines (n=2), then the total (n=3).

Sun/Solaris

I have had multiple reports that the default awk on Sun/Solaris has issues. Please try:

nawk -v OFS='\t' '{for (i=1;i<=NF;i++) {s[2-NR%2,i]+=$i;s[3,i]+=$i;};$1=$1;print} END{for (n=1;n<=3;n++) print s[n,1],s[n,2],s[n,3]}' foo.txt

Or:

/usr/xpg4/bin/awk -v OFS='\t' '{for (i=1;i<=NF;i++) {s[2-NR%2,i]+=$i;s[3,i]+=$i;};$1=$1;print} END{for (n=1;n<=3;n++) print s[n,1],s[n,2],s[n,3]}' foo.txt

Or:

/usr/xpg6/bin/awk -v OFS='\t' '{for (i=1;i<=NF;i++) {s[2-NR%2,i]+=$i;s[3,i]+=$i;};$1=$1;print} END{for (n=1;n<=3;n++) print s[n,1],s[n,2],s[n,3]}' foo.txt

Column manipulation using AWK

Perl is very concise for this: split each line into words, pop off the last word and insert it at index 3 (0-based)

$ perl -lane 'splice @F, 3, 0, pop(@F); print "@F"' file | column -t
chr10  181243  225933  36  1  1  1  10   0
chr10  181500  225933  35  1  1  1  106  0
...

Best Answer

Related Solutions

Sum of alternate values in a column using either sed or nawk

How it works

Sun/Solaris

Column manipulation using AWK

Related Question