Unix – Comparing Two Files Using Awk

awkbashsed

I need to compare two files, File1 and File2 (Separated by space) using 4 fields (Field 1, 2, 4 and 5 of File 1 with field1, 2, 4 and 5 of File2).

Logic:
If column 1, 2 and 4 of File 1 matches with column 1, 2 and 4 of File 2 and there is a mismatch at column 5 then both the lines from File 1 and File 2 are concatenated redirected as output.
Therefore, the output file only contains those lines where Column 1, 2 and 4 of File1 and File2 matches and Column 5 does not match.

File1:

sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/68         20      .        T       C         71       PASS     N=2     F=5;U=4
sc2/24         24      .        T       G         31       PASS     N=2     F=5;U=4

File2:

sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4 
sc2/68         20      .        T        C        71      PASS    N=2       F=5;U=4 
sc2/10         24      .        T        G        31      PASS    N=2       F=5;U=4
sc2/40         59      .        T        G        31      PASS    N=2       F=5;U=4
sc2/24         24      .        A        G        38      PASS    N=2       F=5;U=4

Output:

sc2/80         20      .        A       T        86      PASS     N=2      F=5;U=4
sc2/80         20      .        A       C        80      PASS     N=2      F=5;U=4

sc2/60         55      .        G       T        76      PASS     N=2      F=5;U=4 
sc2/60         55      .        G       C        72      PASS     N=2      F=5;U=4

I am new in the field and I appreciate your help.

Best Answer

You can use awk. Put the following in a script, script.awk:

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
}  

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Now run it like this:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4

The script works as follows. This block creates 3 arrays, f1, f1_c14, and f1_c5. f1 contains all the lines of file1 in an array, indexed using the contents of the columns 1, 2, & 4 from file1. f1_c14 is another array with the same index (1, 2, & 4's contents) and a value of 1. The 3rd array uses the same index as the 1st 2, with the value of the 5th column from file1.

FNR == NR {
  f1[$1,$2,$4] = $0
  f1_c14[$1,$2,$4] = 1
  f1_c5[$1,$2,$4] = $5
  next
} 

The next block is responsible for printing lines from the 1st file, file1 under the conditions that the columns 1, 2, & 4 match the columns from file2, AND it will onlu print the line from file1 if the 5th columns of file1 and file2 do not match.

f1_c14[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}

The 3rd block is responsible for printing the associated line from file2 there's a corresponding line in the array f1 for file2's columns 1, 2, & 4. Again it only prints if the 5th columns do not match.

f1[$1,$2,$4] {
  if ($5 != f1_c5[$1,$2,$4]) print $0;
}

Example

Running the above script like so:

$ awk -f script.awk file1  file2
sc2/80         20      .        A       T         86       PASS     N=2     F=5;U=4
sc2/80         20      .        A        C        80      PASS    N=2       F=5;U=4
sc2/60         55      .        G       T         76       PASS     N=2     F=5;U=4 
sc2/60         55      .        G        C        72      PASS    N=2       F=5;U=4 

You can use the column command to clean up the output slightly:

$ awk -f script.awk file1  file2 | column -t
sc2/80  20  .  A  T  86  PASS  N=2  F=5;U=4
sc2/80  20  .  A  C  80  PASS  N=2  F=5;U=4
sc2/60  55  .  G  T  76  PASS  N=2  F=5;U=4
sc2/60  55  .  G  C  72  PASS  N=2  F=5;U=4

How it works?

FNR == NR

This makes use of awk's ability to loop through files in a particular way. Here's we're looping through the files and when we're on a line that's from the first file, file, we want to run a particular block of code on this line from file1.

This example shows what FNR == NR is doing when we give it 2 simulated files. One has 4 lines in it while the other has 5 lines:

$ awk 'BEGIN {print "NR\tFNR\tline"} {print NR"\t"FNR"\t"$0}' \
     <(seq 1 4) <(seq 1 5)
NR  FNR line
1   1   1
2   2   2
3   3   3
4   4   4
5   1   1
6   2   2
7   3   3
8   4   4
9   5   5
other blocks

The other blocks, f1_c14[$1,$2,$4] AND f1[$1,$2,$4] only run when the values from those array elements has a value.

Related Question