Bash – Script that Outputs the Average of Multiple Group lines, from a File

awkbashlinuxsedshell

This is a "Small" portion of the file car_sales.txt that I am trying to extract information from (the full file has about 700 lines, with more vehicle brands than are listed here):

first_name,last_name,price_paid,brand,year
Mann,Mathers,20500.79,Chevy,2012
Doug,Samual,21000.12,Dodge,2015
Walter,Gray,17000.87,Dodge,2010
Jessica,Garnet,17350.00,MINI,2009
Paula,Raymond,45300.87,BMW,2015
Willie,Reynolds,64950.05,BMW,2015
Sam,Collins,70200.35,Lexus,2014
Katy,Martinez,29580.84,Chevy,2012
Nicole,Davis,31650.60,Chevy,2009
Brenda,Gray,12400.56,Dodge,2012
Samantha,Fernandez,27900.21,MINI,2015
Eric,Woods,68900.85,BMW,2009
George,Luke,33453.91,BMW,2011
Mildred,Takey,46820.80,Lexus,2012

I would like to output the "brand" and "price_paid" columns ( and find the average price paid per brand, for all the cars), sorted (a-z), and first line the "header" removed.
This is the output I am looking for (from the example listed above):

BMW,53151.4
Chevy,27244.1
Dodge,16800.5
Lexus,58510.6
MINI,22625.1

Now I have been working on this trying to figure this out with no luck for 2 days (I am new at this), best I could come up with is this:

sed '1d' car_sales.txt |awk -F ',' '/Chevy/{print $3}' $1|awk '{total += $1; count ++}END{print "Chevy," total/count}'

Now clearly that is "Not" what I am looking for; If I needed the average output for just "one single" Brand/Price_Paid, then yeah it would work, I would just input the single "Pattern" I'm looking for and then I would get the Average price paid.

However I am looking for a way to catch and output the Average Price paid for ALL the brands in the car_sales.txt file. And there are plenty of more brands then just the 5 in the portion I listed (about 50+ brands).

I have read my 3 books that I have and scanned for hours online, and for the life of me I can't figure it out. Maybe I'm not even looking in the right place, I thought awk would be the answer but it so massive. Many thanks for the help in advance.

I then got an Idea thinking that I had figured a way to make it happen, and starting writing this script. Logically it seemed like it would work in my head, the thought was I would use the output from the first function as the output to the 2nd. Alas this didn't work either, I thought I was on the right track but no.

#!/bin/bash

#This will output the car "brand"
function brand {
        sed '1d' $1| cut -d ',' -f 4 |sort|uniq 
}

#The output of function "brand", will be the pattern for function "average"
function average {
    awk -F ',' '/'"$names"'/{print $3}' $1|awk '{total += $1; count ++}END{print "'$names'" "," total/count}'

}

brand $1
names=$(brand)
average $1 $names

Best Answer

Since awk arrays are indexed by strings, you can use one array to keep the total price for that brand so far, and use another array to keep the count of records seen for that brand.

Because "brand" is field 4, you can index the arrays in awk like this:

total_price[$4] += $3        # accumulate total price for this brand
count[$4] += 1               # increment count of records for this brand

At the end, loop through the keys to the arrays, and format the output while calculating the averages.

Since POSIX awk contains no sort function, pipe the output of the awk command to the standard Unix sort command.

Please try this:

Script

#!/bin/sh

#first_name,last_name,price_paid,brand,year
#print for each brand, the average price paid

awk -F, '
    NR == 1 {
        next                        # skip header
    }
    {
        price_paid[$4] += $3        # accumulate total price for this brand       
        count[$4] += 1              # increment count of records for this brand
    }
    END {
        for (brand in price_paid) {
            printf "%s,%7.2f\n", brand, price_paid[brand] / count[brand]
        }
    }
' < "${1:?filename required}" | sort

Annotation/Explananation

  1. Invoke the awk command, setting the Field Separator to comma (,) and passing everything between the single quote on this line and the next single quote several lines below, as the script:

    awk -F, '
    
  2. Skip Header: If the current record number is 1, then skip all processing on the current line (the first line), and get the next line of input:

        NR == 1 {
            next                        # skip header
        }
    
  3. Accumulate Price Total Per Brand (this is executed on every line):
    The arrays price_paid and count are indexed by the brand string.
    Add the current price paid ($3) to the price_paid total for this brand.
    Increment the count of records for this brand:

        {
            price_paid[$4] += $3        # accumulate total price for this brand    
            count[$4] += 1              # increment count of records for this brand
        }
    
  4. Print the Output Table: After all input is processed, step through the keys (brand) to the price_paid array, and for each brand, print the brand and the average of price_paid for that brand:

        END {
            for (brand in price_paid) {
                printf "%s,%7.2f\n", brand, price_paid[brand] / count[brand]
            }
       }
    
  5. Terminate the script argument, redirect input from the filename parameter, and pipe the output of the awk command to the sort command:

    ' < "${1:?filename required}" | sort
    

The single quote (') terminates the script argument to awk.
< "${1:?filename required}" redirects the standard input of awk from the filename specified by the first command line parameter to the script. If there is no parameter, then the shell will print an error message containing "filename required" and exit with error status.

Related Question