Lum – Using awk to identify the number identical columns

awkcolumns

I have a large number of individual files that contain six columns each (number of rows can vary). As a simple example:

1   0   0   0   0   0

0   1   1   1   0   0

I am trying to identify how many unique columns I have (i.e. numbers and their order match), in this case it would be 3.

Is there a simple one-liner to do this? I know it is easy to compare one column with another column, but how to find identical columns?

Best Answer

You can count the unique columns with following pipe:

$ awk '{for (i=1; i<=NF; ++i) a[i]=a[i]$i; } END { for (i in a) print a[i] }' foo \
  | sort -u | wc -l

The awk command transposes your input, the resulting lines are sorted, only unique lines are kept (-u) and at the end all (unique) lines (i.e. the transposed columns) are counted (wc -l).

Note that NF is a builtin awk variable and is automatically set to the number of fields in the current record. $i references the i-th field and END guards the following block such that it is executed after all records are processed. Awk uses by default blank-non-blank field delimiting.

Awk

this awk script will work on an arbitrary number of columns > 2 and order of appearance will be preserved as across then down with no assumptions made about what the columns are (i.e. doesn't matter if they are numeric or not, sorted or not, etc):

{
    for (i = 2; i <= NF; i++) {
        a[j + i] = $1 " " $i
    }
    j += (i - 1);
}
END {
    OutNR = NR * NF;
    for (i = 2; i <= NF; i++) {
        for (j = 0; j < OutNR; j += NF) { 
            print a[j + i];
        }
    }
}

Given:

0 0 0 0.2340
0.05 9.6877884e-06 0.0024898597 0.2341
0.1 4.2838688e-05 0.0049595502 0.2342
0.15 0.00016929444 0.0074092494 0.2343
0.2 0.00036426881 0.009839138 0.2344
0.25 0.00055234582 0.012249394 0.2345
0.3 0.00077448576 0.014640196 0.2346
0.35 0.00082546537 0.017011717 0.2347
0.4 0.0012371619 0.019364133 0.2348
0.45 0.0013286382 0.02169761 0.2349

Order by column (2..n) then by line:

0 0
0.05 9.6877884e-06
0.1 4.2838688e-05
0.15 0.00016929444
0.2 0.00036426881
0.25 0.00055234582
0.3 0.00077448576
0.35 0.00082546537
0.4 0.0012371619
0.45 0.0013286382
0 0
0.05 0.0024898597
0.1 0.0049595502
0.15 0.0074092494
0.2 0.009839138
0.25 0.012249394
0.3 0.014640196
0.35 0.017011717
0.4 0.019364133
0.45 0.02169761
0 0.2340
0.05 0.2341
0.1 0.2342
0.15 0.2343
0.2 0.2344
0.25 0.2345
0.3 0.2346
0.35 0.2347
0.4 0.2348
0.45 0.2349

R

Although most people don't think of R for text processing, in this case, it's actually a bit more straight-forward, although all of the option setting makes it appear to be more complex than it really is. The essence of this solution is to simply rbind() multiple cbind():

d.in <- read.table(file = commandArgs(trailingOnly = T)[1]
                    , colClasses = "character");
d.out<-data.frame();
for (i in 2:length(d.in)) {
    d.out <- rbind(d.out, cbind(d.in[,1], d.in[,i]));
}
write.table(d.out, row.names = F, col.names = F, quote = F);

Then, just:

$ Rscript script.R data.txt
0 0
0.05 9.6877884e-06
0.1 4.2838688e-05
0.15 0.00016929444
0.2 0.00036426881
0.25 0.00055234582
0.3 0.00077448576
0.35 0.00082546537
0.4 0.0012371619
0.45 0.0013286382
0 0
0.05 0.0024898597
0.1 0.0049595502
0.15 0.0074092494
0.2 0.009839138
0.25 0.012249394
0.3 0.014640196
0.35 0.017011717
0.4 0.019364133
0.45 0.02169761
0 0.2340
0.05 0.2341
0.1 0.2342
0.15 0.2343
0.2 0.2344
0.25 0.2345
0.3 0.2346
0.35 0.2347
0.4 0.2348
0.45 0.2349

Best Answer

Related Solutions

Lum – How to reorder a large number of columns

Lum – Append columns in a text file to after the final row

Awk

R

Related Question