Text-Processing – How Does awk ‘!a[$0]++’ Work?

This one-liner removes duplicate lines from text input without pre-sorting.

For example:

$ cat >f
q
w
e
w
r
$ awk '!a[$0]++' <f
q
w
e
r
$

The original code I have found on the internets read:

awk '!_[$0]++'

This was even more perplexing to me as I took _ to have a special meaning in awk, like in Perl, but it turned out to be just a name of an array.

Now, I understand the logic behind the one-liner:
each input line is used as a key in a hash array, thus, upon completion, the hash contains unique lines in the order of arrival.

What I would like to learn is how exactly this notation is interpreted by awk. E.g. what the bang sign (!) means and the other elements of this code snippet.

How does it work?

Best Answer

Here is a "intuitive" answer, for a more in depth explanation of awk's mechanism see either @Cuonglm's

In this case, !a[$0]++, the post-increment ++ can be set aside for a moment, it does not change the value of the expression. So, look at only !a[$0]. Here:

a[$0]

uses the current line $0 as key to the array a, taking the value stored there. If this particular key was never referenced before, a[$0] evaluates to the empty string.

!a[$0]

The ! negates the value from before. If it was empty or zero (false), we now have a true result. If it was non-zero (true), we have a false result. If the whole expression evaluated to true, meaning that a[$0] was not set to begin with, the whole line is printed as the default action.

Also, regardless of the old value, the post-increment operator adds one to a[$0], so the next time the same value in the array is accessed, it will be positive and the whole condition will fail.

Best Answer

Related Solutions

Related Question