Bash – Comparing Files based on 5 fields using Awk and Bash

awkbash

I want to compare File1 and File2 (Separated by spaces) using five fields (Column 1,2,4,5,6).

*Logic:*If column 1 and 2 of File1 and File2 match exactly and if the File2 has the same characters as any of the characters present in column 4 and 5 of file1 then those lines of file1 and file2 are concatenated and redirected as output.

File1:

s2/80   20      .       A       T       86      N=2     F=5;U=4
s2/20   10      .       G       T       90      N=2     F=5;U=4
s2/90   60      .       C       G       30      N=2     F=5;U=4

File2:

s2/90   60      .       G       G       97      N=2     F=5;U=4
s2/80   20      .       A       A       20      N=2     F=5;U=4
s2/15   11      .       A       A       22      N=2     F=5;U=4
s2/90   21      .       C       C       82      N=2     F=5;U=4
s2/20   10      .       G       G       99      N=2     F=5;U=4
s2/80   10      .       T       G       11      N=2     F=5;U=4
s2/90   60      .       G       T       55      N=2     F=5;U=4

Output:

s2/80  20 . A   T   86  N=2 F=5;U=4  s2/80  20  . A   A   20   N=2     F=5;U=4
s2/20  10 . G   T   90  N=2 F=5;U=4  s2/20  10  . G   G   99   N=2     F=5;U=4
s2/90  60 . C   G   30  N=2 F=5;U=4  s2/90  60  . G   G   97   N=2     F=5;U=4

I'm new in this field and would appreciate any guidance.

Best Answer

Here's a solution using just awk. Put the below code in a file called ex.awk:

BEGIN{}
FNR==NR{
    k=$1" "$2
    a[k]=$4" "$5
    b[k]=$0
    c[k]=$4
    d[k]=$5
    next
}

{ k=$1" "$2
  lc=c[k]
  ld=d[k]
  # file1 file2
  if ((k in a) && ($4==$5) && (lc==$4) || (ld==$5)) print b[k]" "$0
}

And then run it like this with the above 2 files:

$ awk -f ex.awk file1 file2

Example

The sed is just to format the output for StackExchange!

$ awk -f ex.awk file1 file2 | sed 's/[ ]\+/  /g'
s2/90  60  .  C  G  30  N=2  F=5;U=4  s2/90  60  .  G  G  97  N=2  F=5;U=4
s2/80  20  .  A  T  86  N=2  F=5;U=4  s2/80  20  .  A  A  20  N=2  F=5;U=4
s2/20  10  .  G  T  90  N=2  F=5;U=4  s2/20  10  .  G  G  99  N=2  F=5;U=4

A change in requirements

The OP mentioned in the comments below that he'd like the ultimate solution to drop any lines where the 4th and 5th columns from file1 matched the 4th and 5th columns from file2.

For example, add this line to both file1 & file2:

s2/40   40      .       S       S       90      N=2     F=5;U=4

A single line addition to the original solution can address this particular change in the requirements.

if ((k in a) && (lc==$4) && (ld==$5)) next

New Example

ex2.awk:

BEGIN{}
FNR==NR{
  k=$1" "$2
  a[k]=$4" "$5
  b[k]=$0
  c[k]=$4
  d[k]=$5
  next
}

{ k=$1" "$2
  lc=c[k]
  ld=d[k]
  if ((k in a) && (lc==$4) && (ld==$5)) next
  if ((k in a) && ($4==$5) && (lc==$4) || (ld==$5)) print b[k]" "$0
}

Rerunning the new awk script, ex2.awk:

$ awk -f ex2.awk file1 file2 | sed 's/[ ]\+/  /g'
s2/90  60  .  C  G  30  N=2  F=5;U=4  s2/90  60  .  G  G  97  N=2  F=5;U=4
s2/80  20  .  A  T  86  N=2  F=5;U=4  s2/80  20  .  A  A  20  N=2  F=5;U=4
s2/20  10  .  G  T  90  N=2  F=5;U=4  s2/20  10  .  G  G  99  N=2  F=5;U=4

Related Solutions

Unix – Comparing Two Files Using Awk

You can use awk. Put the following in a script, script.awk:

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}  

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Now run it like this:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

The script works as follows. This block creates 3 arrays, f1, f1_c14, and f1_c5. f1 contains all the lines of file1 in an array, indexed using the contents of the columns 1, 2, & 4 from file1. f1_c14 is another array with the same index (1, 2, & 4's contents) and a value of 1. The 3rd array uses the same index as the 1st 2, with the value of the 5th column from file1.

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}

The next block is responsible for printing lines from the 1st file, file1 under the conditions that the columns 1, 2, & 4 match the columns from file2, AND it will onlu print the line from file1 if the 5th columns of file1 and file2 do not match.

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

The 3rd block is responsible for printing the associated line from file2 there's a corresponding line in the array f1 for file2's columns 1, 2, & 4. Again it only prints if the 5th columns do not match.

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Example

Running the above script like so:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

You can use the column command to clean up the output slightly:

$ awk -f script.awk file1  file2 | column -t
sc2/80  20  .  A  T  86  PASS  N=2  F=5;U=4
sc2/80  20  .  A  C  80  PASS  N=2  F=5;U=4
sc2/60  55  .  G  T  76  PASS  N=2  F=5;U=4
sc2/60  55  .  G  C  72  PASS  N=2  F=5;U=4

How it works?

FNR == NR

This makes use of awk's ability to loop through files in a particular way. Here's we're looping through the files and when we're on a line that's from the first file, file, we want to run a particular block of code on this line from file1.

This example shows what FNR == NR is doing when we give it 2 simulated files. One has 4 lines in it while the other has 5 lines:

$ awk 'BEGIN {print "NR\tFNR\tline"} {print NR"\t"FNR"\t"$0}' \
     <(seq 1 4) <(seq 1 5)
NR  FNR line
1   1   1
2   2   2
3   3   3
4   4   4
5   1   1
6   2   2
7   3   3
8   4   4
9   5   5

other blocks

The other blocks, f1_c14[$1,$2,$4] AND f1[$1,$2,$4] only run when the values from those array elements has a value.

Matching Five Columns in two Files using Awk

awk '
    {
        key = $1 SUBSEP $2 SUBSEP $4
    }
    # here, we are reading file1
    NR == FNR {
        f1_line[key] = $0 
        next
    }
    # here, we are reading file2
    key in f1_line && ($5 == "." || $5 == $4) {
        print f1_line[key], $0
    }
' file1 file2

outputs

s2/80   20      .       A       T       86      F=5;U=4 s2/80   20      .       A       A       20      F=5;U=4
s2/20   10      .       G       T       90      F=5;U=4 s2/20   10      .       G       .       99      F=5;U=4

Best Answer

Example

A change in requirements

New Example

Related Solutions

Unix – Comparing Two Files Using Awk

Example

How it works?

Matching Five Columns in two Files using Awk

Related Question