Bash – Comparing Files based on 5 fields using Awk and Bash

awkbash

I want to compare File1 and File2 (Separated by spaces) using five fields (Column 1,2,4,5,6).

*Logic:*If column 1 and 2 of File1 and File2 match exactly and if the File2 has the same characters as any of the characters present in column 4 and 5 of file1 then those lines of file1 and file2 are concatenated and redirected as output.

File1:

s2/80   20      .       A       T       86      N=2     F=5;U=4
s2/20   10      .       G       T       90      N=2     F=5;U=4
s2/90   60      .       C       G       30      N=2     F=5;U=4

File2:

s2/90   60      .       G       G       97      N=2     F=5;U=4
s2/80   20      .       A       A       20      N=2     F=5;U=4
s2/15   11      .       A       A       22      N=2     F=5;U=4
s2/90   21      .       C       C       82      N=2     F=5;U=4
s2/20   10      .       G       G       99      N=2     F=5;U=4
s2/80   10      .       T       G       11      N=2     F=5;U=4
s2/90   60      .       G       T       55      N=2     F=5;U=4

Output:

s2/80  20 . A   T   86  N=2 F=5;U=4  s2/80  20  . A   A   20   N=2     F=5;U=4
s2/20  10 . G   T   90  N=2 F=5;U=4  s2/20  10  . G   G   99   N=2     F=5;U=4
s2/90  60 . C   G   30  N=2 F=5;U=4  s2/90  60  . G   G   97   N=2     F=5;U=4

I'm new in this field and would appreciate any guidance.

Best Answer

Here's a solution using just awk. Put the below code in a file called ex.awk:

BEGIN{}
FNR==NR{
    k=$1" "$2
    a[k]=$4" "$5
    b[k]=$0
    c[k]=$4
    d[k]=$5
    next
}

{ k=$1" "$2
  lc=c[k]
  ld=d[k]
  # file1 file2
  if ((k in a) && ($4==$5) && (lc==$4) || (ld==$5)) print b[k]" "$0
}

And then run it like this with the above 2 files:

$ awk -f ex.awk file1 file2

Example

The sed is just to format the output for StackExchange!

$ awk -f ex.awk file1 file2 | sed 's/[ ]\+/  /g'
s2/90  60  .  C  G  30  N=2  F=5;U=4  s2/90  60  .  G  G  97  N=2  F=5;U=4
s2/80  20  .  A  T  86  N=2  F=5;U=4  s2/80  20  .  A  A  20  N=2  F=5;U=4
s2/20  10  .  G  T  90  N=2  F=5;U=4  s2/20  10  .  G  G  99  N=2  F=5;U=4

A change in requirements

The OP mentioned in the comments below that he'd like the ultimate solution to drop any lines where the 4th and 5th columns from file1 matched the 4th and 5th columns from file2.

For example, add this line to both file1 & file2:

s2/40   40      .       S       S       90      N=2     F=5;U=4

A single line addition to the original solution can address this particular change in the requirements.

if ((k in a) && (lc==$4) && (ld==$5)) next

New Example

ex2.awk:

BEGIN{}
FNR==NR{
  k=$1" "$2
  a[k]=$4" "$5
  b[k]=$0
  c[k]=$4
  d[k]=$5
  next
}

{ k=$1" "$2
  lc=c[k]
  ld=d[k]
  if ((k in a) && (lc==$4) && (ld==$5)) next
  if ((k in a) && ($4==$5) && (lc==$4) || (ld==$5)) print b[k]" "$0
}

Rerunning the new awk script, ex2.awk:

$ awk -f ex2.awk file1 file2 | sed 's/[ ]\+/  /g'
s2/90  60  .  C  G  30  N=2  F=5;U=4  s2/90  60  .  G  G  97  N=2  F=5;U=4
s2/80  20  .  A  T  86  N=2  F=5;U=4  s2/80  20  .  A  A  20  N=2  F=5;U=4
s2/20  10  .  G  T  90  N=2  F=5;U=4  s2/20  10  .  G  G  99  N=2  F=5;U=4
Related Question