I have an input file (input.txt) like below.
id1 id2 name weight
53723848 12651711 timburnes 1.36667
53530214 12651711 timburnes 1.51191
53723848 53530214 timburnes 1.94
764157 52986038 ericcartman 0.861145
56797854 764157 ericcartman 1.35258
56797854 52986038 ericcartman 1.73781
Note that the first line is not part of the actual file, I have added it here for clarity.
I am trying to extract the values of the id1
and id2
to 2 separate files named unique.txt and duplicate.txt.
If my weight
column value is greater than 1.5, it means I have duplicate ids. In this case, I will move the id1
value to unique.txt
file and id2
value to duplicate.txt
file.
If my weight column is less than 1.5, it means I do not have duplicate values. So, in this case, I will move both id1
and id2
to unique.txt file.
So for the above input, I am expecting the output as,
For unique.txt file,
53723848 timburnes
764157 ericcartman
56797854 ericcartman
For duplicate.txt file,
12651711 timburnes
53530214 timburnes
52986038 ericcartman
I can find out the duplicates using the below code.
To get the values greater than 1.5 based on 4th column,
awk -F" " '$4 >= 1.5 { print $1" " $2" " $3" " $4}' file1.txt > Output.txt
Now, for values greater than 1.5, I can use the below code to merge the duplicate ids based on their names.
perl -ane 'foreach(@F[0..1]){$k{$F[2]}{$_}++}
END{
foreach $v (sort keys(%k)){
print "$_ " foreach(keys(%{$k{$v}}));
print "$v\n"
};
} ' Output.txt
However, I am not able to get the output in the way I like in the above approach.
EDIT:
I am running the command for my input as below.
awk '{
if ($4 > 1.5) {
if (++dup[$2] == 1) print $2, $3 > "duplicate.txt"
}
else
if (++uniq[$1] == 1) print $1, $3 > "unique.txt"
}' << END
17412193 43979400 ericcartman 2.16667
21757330 54678379 andrewruss 0.55264
END
I am getting the output as,
-bash-3.2$ cat unique.txt
21757330 a.andreev
-bash-3.2$ cat duplicate.txt
43979400 ericcartman
However, the output I am expecting is,
cat unique.txt
17412193 ericcartman
21757330 andrewruss
54678379 andrewruss
cat duplicate.txt
43979400 ericcartman
Best Answer
Here is
awk
solution:With your second example: