Why does gawk treat `0123` as a decimal number when coming from the input data

gawkinputnumeric data

According to $ man gawk, the strtonum() function can convert a string into a number:

strtonum(str) Examine str, and return its numeric value. If
str begins with a leading 0, treat it as an
octal number. If str begins with a leading 0x
or 0X, treat it as a hexadecimal number. Oth‐
erwise, assume it is a decimal number.

And if the string begins with a leading 0, the number is treated as octal, while if it begins with 0x it's treated as hexadecimal.

I've run these commands to check my understanding of the function:

$ awk 'END { print strtonum("0123") }' <<<''
83

$ awk 'END { print strtonum("0x123") }' <<<''
291

The string "0123" is correctly treated as containing an octal number and converted into the decimal number 83.
Similarly, the string "0x123" is correctly treated as containing an hexadecimal number and converted into the decimal number 291.

Now, here's what happens if I run the same commands, but moving the numerical strings from the program text to the input data:

$ awk 'END { print strtonum($1) }' <<<'0123'
123

$ awk 'END { print strtonum($1) }' <<<'0x123'
291

I understand the second result which is identical as in the previous commands, but I don't understand the first one. Why does gawk now treat 0123 as a decimal number, even though it begins with a leading 0 which characterizes octal numbers?

I suspect it has something to do with the strnum attribute, because for some reason 1, gawk gives this attribute to 0123 but not to 0x123:

$ awk 'END { print typeof($1) }' <<<'0123'
strnum

$ awk 'END { print typeof($1) }' <<<'0x123'
string

1 It may be due to a variation between awk implementations:

To clarify, only strings that are coming from a few sources (here quoting the
POSIX spec): […] are to be considered a numeric string if their value happens
to be numerical (allowing leading and trailing blanks, with variations between
implementations in support for hex, octal
, inf, nan…).


I'm using gawk version 4.2.62, and the output of $ awk -V is:

GNU Awk 4.2.62, API: 2.0 (GNU MPFR 3.1.4, GNU MP 6.1.0)

Best Answer

This is related to the generalised strnum handling in version 4.2 of GAWK.

Input values which look like numbers are treated as strnum values, represented internally as having both string and number types. “0123” qualifies as looking like a number, so it is handled as a strnum. strtonum is designed to handle both string and number inputs; it looks for numbers first, and when it encounters an input number, returns the number without transformation:

NODE *
do_strtonum(int nargs)
{
        NODE *tmp;
        AWKNUM d;

        tmp = fixtype(POP_SCALAR());
        if ((tmp->flags & NUMBER) != 0)
                d = (AWKNUM) tmp->numbr;
        else if (get_numbase(tmp->stptr, tmp->stlen, use_lc_numeric) != 10)
                d = nondec2awknum(tmp->stptr, tmp->stlen, NULL);
        else
                d = (AWKNUM) force_number(tmp)->numbr;

        DEREF(tmp);
        return make_number((AWKNUM) d);
}

Thus “0123” becomes the number 123, and strtonum returns that directly.

“0x123” doesn’t look like a number (by the rules defined in the link given above), so it is handled as a string and processed as you’d expect by strtonum.

A number is defined as follows in AWK:

The input string is decomposed into two parts: an initial, possibly empty, sequence of white-space characters (as specified by isspace()) and a subject sequence interpreted as a floating-point constant.

The expected form of the subject sequence is an optional '+' or '-' sign, then a non-empty sequence of digits optionally containing a <period>, then an optional exponent part. An exponent part consists of 'e' or 'E', followed by an optional sign, followed by one or more decimal digits.

The sequence starting with the first digit or the <period> (whichever occurs first) is interpreted as a floating constant of the C language, and if neither an exponent part nor a <period> appears, a is assumed to follow the last digit in the string. If the subject sequence begins with a <hyphen-minus>, the value resulting from the conversion is negated.

Related Question