Lum – Subsetting a file’s contents based on value in a column

columnstext processing

I have a text file as (for example):

apple   V$NFY_Q6_rc V=0.741
apple   V$HOXA7_01_rc   V=0.454
apple   V$ALPHACP1_01_rc    V=0.695
orange  V$SP4_Q5    V=0.747
grapes  V$SP1_Q2_01 V=0.677
grapes  V$SP1_Q6_01_rc  V=0.685
grapes  V$SP1_Q6_rc V=0.884

I want to extract the rows that have the highest V value (in column 3) for each fruit. My output should look like:

apple   V$NFY_Q6_rc V=0.741
orange  V$SP4_Q5    V=0.747
grapes  V$SP1_Q6_rc V=0.884

I have several files like this in different subdirectories.

After subsetting, I would use the following code to get the number of occurences of a particular string in column 2.

        perl -lanE '$str=$F[1];  $f="/home/$str/list/$str.txt"; $c=`grep -c "$str" "$f"`;chomp($c);$x=0;$x++ if $c;say "$str\t$x\t$c"' file2

this would result in the following output. Let's assume I am searching for the string "SP4" in column 2 of this file:

         X       X in file?  number of times it occurs
        NFA            0                            0
        SP4            1                            2
        NFATC1         0                            0

I want the V value (from above to be included along with this table)

          X       X in file?  number of times it occurs      V value
        NFA            0                            0
        SP4            1                            2         0.747
        NFATC1         0                            0

P.S.: For further details on the perl program in between, please refer to this link:

         http://stackoverflow.com/questions/23109490/search-for-occurrence-of-a-string-in-another-file-in-a-particular-column

Best Answer

If order is not important, two passes of sort should do the trick. In the first pass, sort by field 1 followed by reverse-sorting by the numeric portion of field 3 (starting position 4). The -b modifier causes leading blanks to be ignored. Pipe this to a second sort that returns one record for every unique value in field 1, but this time specify the stable sort(-s) modifier to ensure that the record with the highest value in field 3 (which had bubbled up to the top for each value in field 1 courtesy the previous sort) is returned

sort -k1,1 -k3.4b,3nr file.txt | sort -k1,1 -s -u
apple   V$NFY_Q6_rc V=0.741
grapes  V$SP1_Q6_rc V=0.884
orange  V$SP4_Q5    V=0.747
Related Question