awk Text Processing – Handling ‘inf’ and ‘-inf’ Differently

awk

Given a single column file of numbers, call it f, the following awk code will return the maximum value

cat f | awk    '   BEGIN {max = -inf}
                       {if ($1>max) max=$1} 
                   END { print max }
               '

The same approach to get the minimum doesn't produce anything

cat f | awk '
               BEGIN  {min = inf}
                  {if ($1<min) min=$1} 
               END {print min}
            '

But if instead of using inf, I start off with min = [some large number], if the number is large enough, depending upon what's in the file, then the revised code works.

Why doesn't inf work, and is there some way to make the min case work like the max case, without having to know what's in the file?

Best Answer

The actual task is best solved by initializing your max/min values not by an imaginary "smallest" or "greatest" number (which may not be implemented in the framework you are using, in this case awk), but by initializing it using actual data. That way, it is always guaranteed to provide a meaningful result.

In your case, you can use the very first value you encounter (i.e. the entry in the first line) to initialize max and min, respectively, by adding a rule

NR==1{min=$1}

to your awk script. Then, if the first value is already the minimum, the subsequent test will not overwrite it, and in the end the correct result will be produced. The same holds for searches of the maximum value, so in combined searches, you can state

NR==1{max=min=$1}

As for the reason why your approach with inf didn't work with awk whereas -inf seemed to, @steeldriver has provided a good explanation in a comment to your question, which I will also summarize for the sake of completeness:

  • In awk, variables are "dynamically typed", i.e. everything can be a string or a number depending on use (but awk will "remember" what it was last used as and keep that information along for use in the next operation).
  • Whenever arithmetic operations involving a variable are found in the code, awk will try to interpret the content of that variable as a number and perform the operation, from where on the variable is typed as numerical if successful.
  • The default value for any variable that has not yet been assigned anything is the empty string, which is interpreted as 0 in arithmetic operations.
  • The variable name(*) inf has no special meaning in awk, hence when used just so, it is an empty variable that will evaluate to 0 in an arithmetic expression such as -inf. Therefore, the "maximum search" with the max variable initialized to -inf works if your data is all positive, because -inf is simply 0 (and as such, the smallest non-negative number).
  • In the "minimum search" problem, however, initializing min to inf will initialize the variable to the empty string, as no arithmetic operation is present that would warrant an automatic conversion of that empty string to a number.
  • Therefore, in the later comparisons

    if ($1<min) min=$1
    

    the input, $1, is compared with a string value, which is why awk treats $1 as a string, too, and performs a lexicographical comparison rather than a numerical one.

  • However, lexicographically, nothing is "smaller" than the empty string, and so min never gets assigned a new value. Therefore, in the END section, the statement

    print min
    

    prints the (still) empty string.

(*) see Stephen Kitt's answer on how a string with content "inf" can actually have a meaning in awk.

Related Question