Determine type of an awk variable

awkgawkvariable

I have the gawk version of awk. In this part of gawk manual, it is stated that awk variables have "attributes", which are used to determine how to treat them in various operations.

For example, a string that is of the form " +3.14" which is obtained by parsing the input has the STRNUM attribute, which makes it behave as a number in a comparison with a number, whereas the same string defined in an awk program does not have this attribute.

OTOH, a string like "3.14" apparently has STRNUM attribute, even if it was defined in the program because the code x = "3.14" { print x == 3.14 } prints 1. Whereas if we define it as "+3.13" or " 3.14", it does not have STRNUM attribute since the x = "+3.14" { print x == 3.14 } or x = " 3.14" { print x == 3.14 } prints 0.

I think that such succinctness in variable typing may cause subtle bugs. Hence, in order to aid in debugging such situations, is there a way to learn what type of "attributes" a variable has? I.e, can we learn what is the type of a variable?

Best Answer

Awk has 4 types: "number", "string", "numeric string" and "undefined". Here is a function to detect that:

function o_class(obj,   q, x, z) {
  q = CONVFMT
  CONVFMT = "% g"
    split(" " obj "\1" obj, x, "\1")
    x[1] = obj == x[1]
    x[2] = obj == x[2]
    x[3] = obj == 0
    x[4] = obj "" == +obj
  CONVFMT = q
  z["0001"] = z["1101"] = z["1111"] = "number"
  z["0100"] = z["0101"] = z["0111"] = "string"
  z["1100"] = z["1110"] = "strnum"
  z["0110"] = "undefined"
  return z[x[1] x[2] x[3] x[4]]
}

For the third argument of split, you need something that is not a space, and not part of obj or else it will be treated as a delimiter. I chose \1 based on Stéphane suggestion. The function does internal CONVFMT toggling, so it should return the correct result regardless of CONVFMT value at the time of the function call:

split("12345.6", q); print 1, o_class(q[1])
CONVFMT = "%.5g"; split("12345.6", q); print 2, o_class(q[1])
split("nan", q); print 3, o_class(q[1])
CONVFMT = "%.6G"; split("nan", q); print 4, o_class(q[1])

Result:

1 strnum
2 strnum
3 strnum
4 strnum

Full test suite:

print 1, o_class(0)
print 2, o_class(1)
print 3, o_class(123456.7)
print 4, o_class(1234567.8)
print 5, o_class(+"inf")
print 6, o_class(+"nan")
print 7, o_class("")
print 8, o_class("0")
print 9, o_class("1")
print 10, o_class("inf")
print 11, o_class("nan")
split("00", q); print 12, o_class(q[1])
split("01", q); print 13, o_class(q[1])
split("nan", q); print 14, o_class(q[1])
split("12345.6", q); print 15, o_class(q[1])
print 16, o_class()

Result:

1 number
2 number
3 number
4 number
5 number
6 number
7 string
8 string
9 string
10 string
11 string
12 strnum
13 strnum
14 strnum
15 strnum
16 undefined

The notable weakness is: if you provide "numeric string" of any of the following, the function will incorrectly return "number":

  • integer
  • inf
  • -inf

For integers, this is explained:

A numeric value that is exactly equal to the value of an integer shall be converted to a string by the equivalent of a call to the sprintf function with the string %d as the fmt argument

However inf and -inf behave this way as well; that is to say that none of the above can be influenced by the CONVFMT variable:

CONVFMT = "% g"
print "" .1
print "" (+"nan")
print "" 1
print "" (+"inf")
print "" (+"-inf")

Result:

 0.1
 nan
1
inf
-inf

In practice this doesn’t really matter, see the Duck test.

Related Question