I need to compare two files, File1 and File2 (Separated by space) using 4 fields (Field 1, 2, 4 and 5 of File 1 with field1, 2, 4 and 5 of File2).
Logic:
If column 1, 2 and 4 of File 1 matches with column 1, 2 and 4 of File 2 and there is a mismatch at column 5 then both the lines from File 1 and File 2 are concatenated redirected as output.
Therefore, the output file only contains those lines where Column 1, 2 and 4 of File1 and File2 matches and Column 5 does not match.
File1:
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/68 20 . T C 71 PASS N=2 F=5;U=4
sc2/24 24 . T G 31 PASS N=2 F=5;U=4
File2:
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
sc2/68 20 . T C 71 PASS N=2 F=5;U=4
sc2/10 24 . T G 31 PASS N=2 F=5;U=4
sc2/40 59 . T G 31 PASS N=2 F=5;U=4
sc2/24 24 . A G 38 PASS N=2 F=5;U=4
Output:
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
I am new in the field and I appreciate your help.
Best Answer
You can use
awk
. Put the following in a script,script.awk
:Now run it like this:
The script works as follows. This block creates 3 arrays,
f1
,f1_c14
, andf1_c5
.f1
contains all the lines of file1 in an array, indexed using the contents of the columns 1, 2, & 4 from file1.f1_c14
is another array with the same index (1, 2, & 4's contents) and a value of1
. The 3rd array uses the same index as the 1st 2, with the value of the 5th column from file1.The next block is responsible for printing lines from the 1st file,
file1
under the conditions that the columns 1, 2, & 4 match the columns fromfile2
, AND it will onlu print the line fromfile1
if the 5th columns offile1
andfile2
do not match.The 3rd block is responsible for printing the associated line from
file2
there's a corresponding line in the arrayf1
forfile2
's columns 1, 2, & 4. Again it only prints if the 5th columns do not match.Example
Running the above script like so:
You can use the
column
command to clean up the output slightly:How it works?
FNR == NRThis makes use of
awk
's ability to loop through files in a particular way. Here's we're looping through the files and when we're on a line that's from the first file,file
, we want to run a particular block of code on this line fromfile1
.This example shows what
other blocksFNR == NR
is doing when we give it 2 simulated files. One has 4 lines in it while the other has 5 lines:The other blocks,
f1_c14[$1,$2,$4]
ANDf1[$1,$2,$4]
only run when the values from those array elements has a value.