Text Processing – Column Deletion Based on Number of String Matches

awktext processing

I need a command that will remove any columns in a text file if they have =>${MaxAllowedNumberOfFs} 'F's within the column (a column that will have a varying number of rows).

I have some pseudo code that is close, but I don't know how to set a match number limiter.

say the limiter is set to 3 and the
Example input file is:

F G F H H
G F F F A
F G F F F
F F F T F

Then the desired output would be:

G H H
F F A
G F F
F T F

pseudo code that's close (the limiter can and will change depending on files):

MaxAllowedNumberOfFs="1012"

Count_of_columns=`awk '{print NF}' filename | sort -nr | sed -n '$p'` 

for((i=1;i<=$Count_of_columns;i++)); do awk -v i="$i" -v x="$MaxAllowedNumberOfFs" '$i == F =>x number of times {$i="";print $0}' filename; done

Obviously I could loop through all the columns count number of occurences within column using grep, and then remove columns that don't meet criteria. but that would be really slow. Really want a pretty awk command for this, but I don't have the awk skills

Best Answer

One approach is to read the file twice. The first time one counts the F's, and the second time one outputs the line. So something like

#!/bin/sh

awk -v n=3 '
        NR==FNR { for (i=1;i<=NF;i++) { if ($i == "F") { c[i]++ }} ;next }                                                                            
        { for (i=1;i<=NF;i++) { if (c[i] < n) { printf("%s ", $i) } } ;printf("\n") }                                                                 

' filename filename

The NR==FNR is a trick to see if this is the first or second time we are reading the file. Assuming there are any lines at all in the file then it is true only when reading the file the first time. The array c is a count of the number of F characters in that column. The next says that all the processing for that line is finished when reading the file the first time. The second line is executed the second time the file is read.

Related Question