I have a file with several rows and columns. I want to select columns where the number 2 appears more than x times.
My tab separated file looks like this:
Individuals M1 M2 M3
Ind1 0 0 2
Ind2 0 2 2
Ind3 2 2 2
In this cartoon example, let's say that I want the columns where the number 2 appears two or more times. My output would be:
Individuals M2 M3
Ind1 0 2
Ind2 2 2
Ind3 2 2
With R this is quite easy, but takes forever because the file is too big, so I would like to do it with awk or something similar. Could you please tell me how to achieve this?
Best Answer
This
awk
program would count the number of occurrences of the value2
in each column and store these counts in the arrayc
(one lement in this array per column of data). It does this while reading the input file the first time (this is theFNR == NR
block).When reading the input file a second time it uses these counts to transfer the appropriate columns from the input to the array
a
for each line read. The value of the variablet
is used as the threshold value to decide whether the column should be included or not. This is the firstfor
loop in the last block in the code.It then creates a new data record from this array and prints it.
Testing it (note that the input file is given twice on the command line for
awk
to be able to do two passes over it):