Awk – How to Use Shell Variable Inside a Pattern

awk

I have a text file with tab-separated columns which I'd like to process using awk.

Here's an example of such a file:

size=1\tname=foo\tweight=1.2
weight=2.5\tname=bar\tsize=2

What I want to achieve is to normalize the numeric value in columns whose content is like $field_name=<number> to four decimal places and keep the rest as is. Here, $field_name is a shell variable that is passed to awk and I'd like to use its value inside a regex.

Here's a snippet (which is not working of course). I'm particularly interested in fixing line #5 in the following awk script and not solutions using other tools, such as sed, perl, python, etc.

$ cat "${file}" \                                       # 1
    | awk -F "\t" -v field_name="${external_var}" '     # 2
      {                                                 # 3
        for (i = 1; i <= NF; i++) {                     # 4
          if ($i ~ /$field_name=[0-9]*.?[0-9]+/) {      # 5
            split($i, kv, "=")                          # 6
            $i = sprintf("%s=%.4f", kv[1], kv[2])       # 7
          }                                             # 8
        }                                               # 9
        print $0                                        # 10
      }'

Best Answer

That should be:

if ($i ~ field_name "=[0-9]*.?[0-9]+") ...

Or:

 regexp = field_name "=[0-9]*.?[0-9]+"
 if ($i ~ regexp) ...

Note that . matches any single character. If you want to match a literal ., you'd need regexp to contain \. (which inside double quotes would have to be written \\.) or [.].

 regexp = field_name "=[0-9]*\\.?[0-9]+"

I'd also expect you'd want to anchor the regexp:

 regexp = "^" field_name "=[0-9]*\\.?[0-9]+$"

Other notes:

  • cat "${file}" is a UUOC which also has the drawback (over a redirection) that it doesn't work when $file starts with - and still runs awk if the file can't be opened.
  • -v field_name="$external_data" mangles backslashes. Another approach that doesn't have the problem is to use an environment variable: FIELD="$external_data" awk ... and refer to it within awk as ENVIRON["FIELD"].
  • as the contents of field_name is copied verbatim into regexp, it is treated as a regexp, so if $external_data contains regexp operators (.+*?{}()[]\^%...), it may not work properly.
  • in some locales and awk implementations, [0-9] matches a lot more characters than just 0123456789 (though I suspect it would be (non-ASCII) characters unlikely to occur in your input).

With perl:

FIELD=size <"$file" perl -lpe '
  s{
    (?<![^\t])       # not-preceded by a non-TAB
    \Q$ENV{FIELD}=\E # contents of $FIELD taken literally
    \K               # matched portion starts here
    \d*\.?\d+
    (?![^\t])        # not followed by a non-TAB
  }{
    sprintf "%.4f", $&
  }gxe'

Which would not have any of the issues discussed above (it would also work better than awk with input that contains invalid text, like a mix of text and binary data, or text encoded in a charset different from that of the user's locale).

Related Question