Linux – Select Columns Where Value Appears More Than X Times Using awk

awklinuxtext processing

I have a file with several rows and columns. I want to select columns where the number 2 appears more than x times.

My tab separated file looks like this:

Individuals  M1 M2 M3
Ind1          0 0  2
Ind2          0 2  2
Ind3          2 2  2

In this cartoon example, let's say that I want the columns where the number 2 appears two or more times. My output would be:

Individuals   M2 M3
Ind1          0  2
Ind2          2  2
Ind3          2  2

With R this is quite easy, but takes forever because the file is too big, so I would like to do it with awk or something similar. Could you please tell me how to achieve this?

Best Answer

BEGIN { OFS = FS = "\t" }

FNR == NR {
        for (i = 2; i <= NF; ++i)
                if ($i == 2) ++c[i]
        next
}

{
        a[nf=1] = $1
        for (i = 2; i <= NF; ++i)
                if (c[i] >= t) a[++nf] = $i

        $0 = ""
        for (i = 1; i <= nf; ++i)
                $i = a[i]

        print
}

This awk program would count the number of occurrences of the value 2 in each column and store these counts in the array c (one lement in this array per column of data). It does this while reading the input file the first time (this is the FNR == NR block).

When reading the input file a second time it uses these counts to transfer the appropriate columns from the input to the array a for each line read. The value of the variable t is used as the threshold value to decide whether the column should be included or not. This is the first for loop in the last block in the code.

It then creates a new data record from this array and prints it.

Testing it (note that the input file is given twice on the command line for awk to be able to do two passes over it):

$ cat file
Individuals     M1      M2      M3
Ind1    0       0       2
Ind2    0       2       2
Ind3    2       2       2
$ awk -v t=1 -f script.awk file file
Individuals     M1      M2      M3
Ind1    0       0       2
Ind2    0       2       2
Ind3    2       2       2
$ awk -v t=2 -f script.awk file file
Individuals     M2      M3
Ind1    0       2
Ind2    2       2
Ind3    2       2
$ awk -v t=3 -f script.awk file file
Individuals     M3
Ind1    2
Ind2    2
Ind3    2
$ awk -v t=4 -f script.awk file file
Individuals
Ind1
Ind2
Ind3