This one-liner removes duplicate lines from text input without pre-sorting.
For example:
$ cat >f
q
w
e
w
r
$ awk '!a[$0]++' <f
q
w
e
r
$
The original code I have found on the internets read:
awk '!_[$0]++'
This was even more perplexing to me as I took _
to have a special meaning in awk, like in Perl, but it turned out to be just a name of an array.
Now, I understand the logic behind the one-liner:
each input line is used as a key in a hash array, thus, upon completion, the hash contains unique lines in the order of arrival.
What I would like to learn is how exactly this notation is interpreted by awk. E.g. what the bang sign (!
) means and the other elements of this code snippet.
How does it work?
Best Answer
Here is a "intuitive" answer, for a more in depth explanation of awk's mechanism see either @Cuonglm's
In this case,
!a[$0]++
, the post-increment++
can be set aside for a moment, it does not change the value of the expression. So, look at only!a[$0]
. Here:uses the current line
$0
as key to the arraya
, taking the value stored there. If this particular key was never referenced before,a[$0]
evaluates to the empty string.The
!
negates the value from before. If it was empty or zero (false), we now have a true result. If it was non-zero (true), we have a false result. If the whole expression evaluated to true, meaning thata[$0]
was not set to begin with, the whole line is printed as the default action.Also, regardless of the old value, the post-increment operator adds one to
a[$0]
, so the next time the same value in the array is accessed, it will be positive and the whole condition will fail.