Perl or awk solution for this problem

awkperl

I have an input file (input.txt) like below.

id1      id2       name    weight 
53723848 12651711 timburnes 1.36667
53530214 12651711 timburnes 1.51191
53723848 53530214 timburnes 1.94
764157 52986038 ericcartman 0.861145
56797854 764157 ericcartman 1.35258
56797854 52986038 ericcartman 1.73781

Note that the first line is not part of the actual file, I have added it here for clarity.

I am trying to extract the values of the id1 and id2 to 2 separate files named unique.txt and duplicate.txt.

If my weight column value is greater than 1.5, it means I have duplicate ids. In this case, I will move the id1 value to unique.txt file and id2 value to duplicate.txt file.

If my weight column is less than 1.5, it means I do not have duplicate values. So, in this case, I will move both id1 and id2 to unique.txt file.

So for the above input, I am expecting the output as,

For unique.txt file,

53723848 timburnes
764157 ericcartman
56797854 ericcartman

For duplicate.txt file,

12651711 timburnes
53530214 timburnes
52986038 ericcartman

I can find out the duplicates using the below code.

To get the values greater than 1.5 based on 4th column,

awk -F" " '$4 >= 1.5 { print $1" " $2" " $3" " $4}' file1.txt > Output.txt

Now, for values greater than 1.5, I can use the below code to merge the duplicate ids based on their names.

  perl -ane 'foreach(@F[0..1]){$k{$F[2]}{$_}++}
           END{
                foreach $v (sort keys(%k)){
                    print "$_ " foreach(keys(%{$k{$v}})); 
                    print "$v\n"
                }; 
            } ' Output.txt

However, I am not able to get the output in the way I like in the above approach.

EDIT:

I am running the command for my input as below.

awk '{
      if ($4 > 1.5) { 
          if (++dup[$2] == 1)  print $2, $3 > "duplicate.txt"
      } 
      else
          if (++uniq[$1] == 1) print $1, $3 > "unique.txt" 
}' << END
17412193 43979400 ericcartman 2.16667
21757330 54678379 andrewruss 0.55264
END

I am getting the output as,

-bash-3.2$ cat unique.txt
21757330 a.andreev
-bash-3.2$ cat duplicate.txt
43979400 ericcartman

However, the output I am expecting is,

cat unique.txt
17412193 ericcartman
21757330 andrewruss
54678379 andrewruss
cat duplicate.txt
43979400 ericcartman

Best Answer

Here is awk solution:

$ awk '
    $4 < 1.5 {
      uniq[$1] = $3;
      uniq[$2] = $3;
      next;
  }
  {
      uniq[$1] = $3;
      dup[$2] = $3;
      delete uniq[$2];
  }
  END {
    print "--unique.txt--";
    for(i in uniq) {
        print i,uniq[i]
    }
    print "";
    print "--duplicate.txt--";
    for(i in dup) {
        print i,dup[i]
    }
    }' file
--unique.txt--
764157 ericcartman
56797854 ericcartman
53723848 timburnes

--duplicate.txt--
53530214 timburnes
52986038 ericcartman
12651711 timburnes

With your second example:

$ awk '
    $4 < 1.5 {
      uniq[$1] = $3;
      uniq[$2] = $3;
      next;
  }
  {
      uniq[$1] = $3;
      dup[$2] = $3;
      delete uniq[$2];
  }
  END {
    print "--unique.txt--";
    for(i in uniq) {
        print i,uniq[i]
    }
    print "";
    print "--duplicate.txt--";
    for(i in dup) {
        print i,dup[i]
    }
    }' << END
> 17412193 43979400 ericcartman 2.16667
> 21757330 54678379 andrewruss 0.55264
END
--unique.txt--
21757330 andrewruss
54678379 andrewruss
17412193 ericcartman

--duplicate.txt--
43979400 ericcartman

Related Solutions

How to print top five highest numbers from a column

sort -k3n,3 filename | tail -5 | cut -d " " -f1,6-7

The above command will sort the file on the 3rd field. Now, I am piping this output to the tail command to print the top 5 numbers in the 3rd column. However, if you need only the first column and this 3rd column in the output, you can pipe the output to cut command.

Testing

cat filename

T_235820.1|   139697  192 0
xm|161622288|ref|RT_340093.1|   153819  2607    0
xm|75755638|ref|RT_557407.1|    153821  1937    0
xm|108773031|ref|RT_678101.1|   161452  1688    0
xm|30352011|ref|RT_784766.1|    150568  105 0
T_235820.1|   139697  192 0
xm|161622288|ref|RT_340093.1|   153819  607    0
xm|75755638|ref|RT_557407.1|    153821  937    0
xm|108773031|ref|RT_678101.1|   161452  1881    0
xm|30352011|ref|RT_784766.1|    150568  1051 0

Now, I run the above command on this file.

sort -k3n,3 filename | tail -5 | cut -d " " -f1,6-7

The output that I get is,

xm|30352011|ref|RT_784766.1|  1051
xm|108773031|ref|RT_678101.1| 1688 
xm|108773031|ref|RT_678101.1| 1881 
xm|75755638|ref|RT_557407.1|  1937
xm|161622288|ref|RT_340093.1| 2607

EDIT

You can add the -g flag for floating point and negative numbers as well in case if you have any in your file. The command would look like,

sort -k3ng,3 filename | tail -5 | cut -d " " -f1,6-7

How to compare the strings using < (Greater than symbol)

A simple solution is to use awk. Since awk splits its input into fields on whitespace (by default), the PS sensor value will be field 7 ($7), the min will be $11 and the max will be $15. You can, therefore, do:

awk  '$7>$11 && $7<$15' file > new.file

The default action when an expression evaluates to true in awk is to print the current line. Therefore, the command above will print all lines whose 7th field is between the min and max values.

Best Answer

Related Solutions

How to print top five highest numbers from a column

How to compare the strings using < (Greater than symbol)

Related Question