Lum – Using awk to identify the number identical columns

awkcolumns

I have a large number of individual files that contain six columns each (number of rows can vary). As a simple example:

1   0   0   0   0   0

0   1   1   1   0   0

I am trying to identify how many unique columns I have (i.e. numbers and their order match), in this case it would be 3.

Is there a simple one-liner to do this? I know it is easy to compare one column with another column, but how to find identical columns?

Best Answer

You can count the unique columns with following pipe:

$ awk '{for (i=1; i<=NF; ++i) a[i]=a[i]$i; } END { for (i in a) print a[i] }' foo \
  | sort -u | wc -l

The awk command transposes your input, the resulting lines are sorted, only unique lines are kept (-u) and at the end all (unique) lines (i.e. the transposed columns) are counted (wc -l).

Note that NF is a builtin awk variable and is automatically set to the number of fields in the current record. $i references the i-th field and END guards the following block such that it is executed after all records are processed. Awk uses by default blank-non-blank field delimiting.

Related Question