AWK – Process Last Line First Using AWK

awktext processing

I have a data file that I want to normalize using awk, based on the last datapoint. Therefor, I would like to access the last data point first, to normalize the data, then process normally.

The following method, using tac twice, does the job, but, is maybe more complicated than necessary.

$ cat file
0 5
1 2
2 3
3 4
$ tac file | awk 'NR==1{norm=$2} {print $1, $2/norm}' | tac
0 1.25
1 0.5
2 0.75
3 1

My question is the following: Is it possible to obtain the above result by using awk only?

I think the answer is "No, awk scans the file line by line", but I am open for suggestions for alternatives.

Best Answer

You can do it as a two-pass solution in awk:

awk 'FNR == NR { n = $2; next } { print $1, $2/n }' infile infile

If your version of awk supports the ENDFILE block (e.g. GNU awk 4+), you can do it like this:

awk 'ENDFILE { n = $2 } FNR != NR { print $1, $2/n }' infile infile

Note that it is more efficient to seek to the end of the file first see camh's answer.

Explanation

The first example works by remembering the previous $2, i.e. it is only evaluated when the local line counter (FNR) is equal to the global line counter (NR). The next command skips to the next line, in this case it ensures that the last block is only evaluated when the second argument is parsed.

The second example has similar logic, but takes advantage of the the ENDFILE block which is evaluated when the end of an input-file is reached.

How it works

f{a=a"\n"$0}

If the variable f is true (nonzero), then append the current line onto the end of variable a.
/Condition 1/{a=$0; f=1}

If the current line contains Condition 1, then set s to the current line and set variable f to 1.
f && /Condition 2/{print a; f=0}

If f is true and the current line contains Condition 2, then print variable a and set f back to zero.

Awk – How to Use Shell Variable Inside a Pattern

That should be:

if ($i ~ field_name "=[0-9]*.?[0-9]+") ...

Or:

 regexp = field_name "=[0-9]*.?[0-9]+"
 if ($i ~ regexp) ...

Note that . matches any single character. If you want to match a literal ., you'd need regexp to contain \. (which inside double quotes would have to be written \\.) or [.].

 regexp = field_name "=[0-9]*\\.?[0-9]+"

I'd also expect you'd want to anchor the regexp:

 regexp = "^" field_name "=[0-9]*\\.?[0-9]+$"

Other notes:

cat "${file}" is a UUOC which also has the drawback (over a redirection) that it doesn't work when $file starts with - and still runs awk if the file can't be opened.
-v field_name="$external_data" mangles backslashes. Another approach that doesn't have the problem is to use an environment variable: FIELD="$external_data" awk ... and refer to it within awk as ENVIRON["FIELD"].
as the contents of field_name is copied verbatim into regexp, it is treated as a regexp, so if $external_data contains regexp operators (.+*?{}()[]\^%...), it may not work properly.
in some locales and awk implementations, [0-9] matches a lot more characters than just 0123456789 (though I suspect it would be (non-ASCII) characters unlikely to occur in your input).

With perl:

FIELD=size <"$file" perl -lpe '
  s{
    (?<![^\t])       # not-preceded by a non-TAB
    \Q$ENV{FIELD}=\E # contents of $FIELD taken literally
    \K               # matched portion starts here
    \d*\.?\d+
    (?![^\t])        # not followed by a non-TAB
  }{
    sprintf "%.4f", $&
  }gxe'

Which would not have any of the issues discussed above (it would also work better than awk with input that contains invalid text, like a mix of text and binary data, or text encoded in a charset different from that of the user's locale).

Best Answer

Related Solutions

AWK – Print Ranges from Last Occurrence of Pattern to First Occurrence of Another

How it works

Awk – How to Use Shell Variable Inside a Pattern

Related Question