Perl or awk solution for this problem

awkperl

I have an input file (input.txt) like below.

id1      id2       name    weight 
53723848 12651711 timburnes 1.36667
53530214 12651711 timburnes 1.51191
53723848 53530214 timburnes 1.94
764157 52986038 ericcartman 0.861145
56797854 764157 ericcartman 1.35258
56797854 52986038 ericcartman 1.73781

Note that the first line is not part of the actual file, I have added it here for clarity.

I am trying to extract the values of the id1 and id2 to 2 separate files named unique.txt and duplicate.txt.

If my weight column value is greater than 1.5, it means I have duplicate ids. In this case, I will move the id1 value to unique.txt file and id2 value to duplicate.txt file.

If my weight column is less than 1.5, it means I do not have duplicate values. So, in this case, I will move both id1 and id2 to unique.txt file.

So for the above input, I am expecting the output as,

For unique.txt file,

53723848 timburnes
764157 ericcartman
56797854 ericcartman

For duplicate.txt file,

12651711 timburnes
53530214 timburnes
52986038 ericcartman

I can find out the duplicates using the below code.

To get the values greater than 1.5 based on 4th column,

awk -F" " '$4 >= 1.5 { print $1" " $2" " $3" " $4}' file1.txt > Output.txt

Now, for values greater than 1.5, I can use the below code to merge the duplicate ids based on their names.

  perl -ane 'foreach(@F[0..1]){$k{$F[2]}{$_}++}
           END{
                foreach $v (sort keys(%k)){
                    print "$_ " foreach(keys(%{$k{$v}})); 
                    print "$v\n"
                }; 
            } ' Output.txt

However, I am not able to get the output in the way I like in the above approach.

EDIT:

I am running the command for my input as below.

awk '{
      if ($4 > 1.5) { 
          if (++dup[$2] == 1)  print $2, $3 > "duplicate.txt"
      } 
      else
          if (++uniq[$1] == 1) print $1, $3 > "unique.txt" 
}' << END
17412193 43979400 ericcartman 2.16667
21757330 54678379 andrewruss 0.55264
END 

I am getting the output as,

-bash-3.2$ cat unique.txt
21757330 a.andreev
-bash-3.2$ cat duplicate.txt
43979400 ericcartman

However, the output I am expecting is,

cat unique.txt
17412193 ericcartman
21757330 andrewruss
54678379 andrewruss
cat duplicate.txt
43979400 ericcartman

Best Answer

Here is awk solution:

$ awk '
    $4 < 1.5 {
      uniq[$1] = $3;
      uniq[$2] = $3;
      next;
  }
  {
      uniq[$1] = $3;
      dup[$2] = $3;
      delete uniq[$2];
  }
  END {
    print "--unique.txt--";
    for(i in uniq) {
        print i,uniq[i]
    }
    print "";
    print "--duplicate.txt--";
    for(i in dup) {
        print i,dup[i]
    }
    }' file
--unique.txt--
764157 ericcartman
56797854 ericcartman
53723848 timburnes

--duplicate.txt--
53530214 timburnes
52986038 ericcartman
12651711 timburnes

With your second example:

$ awk '
    $4 < 1.5 {
      uniq[$1] = $3;
      uniq[$2] = $3;
      next;
  }
  {
      uniq[$1] = $3;
      dup[$2] = $3;
      delete uniq[$2];
  }
  END {
    print "--unique.txt--";
    for(i in uniq) {
        print i,uniq[i]
    }
    print "";
    print "--duplicate.txt--";
    for(i in dup) {
        print i,dup[i]
    }
    }' << END
> 17412193 43979400 ericcartman 2.16667
> 21757330 54678379 andrewruss 0.55264
END
--unique.txt--
21757330 andrewruss
54678379 andrewruss
17412193 ericcartman

--duplicate.txt--
43979400 ericcartman
Related Question