You can use awk
. Put the following in a script, script.awk
:
FNR == NR {
f1[$1,$2,$4] = $0
f1_c14[$1,$2,$4] = 1
f1_c5[$1,$2,$4] = $5
next
}
f1_c14[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}
f1[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print $0;
}
Now run it like this:
$ awk -f script.awk file1 file2
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
The script works as follows. This block creates 3 arrays, f1
, f1_c14
, and f1_c5
. f1
contains all the lines of file1 in an array, indexed using the contents of the columns 1, 2, & 4 from file1. f1_c14
is another array with the same index (1, 2, & 4's contents) and a value of 1
. The 3rd array uses the same index as the 1st 2, with the value of the 5th column from file1.
FNR == NR {
f1[$1,$2,$4] = $0
f1_c14[$1,$2,$4] = 1
f1_c5[$1,$2,$4] = $5
next
}
The next block is responsible for printing lines from the 1st file, file1
under the conditions that the columns 1, 2, & 4 match the columns from file2
, AND it will onlu print the line from file1
if the 5th columns of file1
and file2
do not match.
f1_c14[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}
The 3rd block is responsible for printing the associated line from file2
there's a corresponding line in the array f1
for file2
's columns 1, 2, & 4. Again it only prints if the 5th columns do not match.
f1[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print $0;
}
Example
Running the above script like so:
$ awk -f script.awk file1 file2
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
You can use the column
command to clean up the output slightly:
$ awk -f script.awk file1 file2 | column -t
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
How it works?
FNR == NR
This makes use of awk
's ability to loop through files in a particular way. Here's we're looping through the files and when we're on a line that's from the first file, file
, we want to run a particular block of code on this line from file1
.
This example shows what FNR == NR
is doing when we give it 2 simulated files. One has 4 lines in it while the other has 5 lines:
$ awk 'BEGIN {print "NR\tFNR\tline"} {print NR"\t"FNR"\t"$0}' \
<(seq 1 4) <(seq 1 5)
NR FNR line
1 1 1
2 2 2
3 3 3
4 4 4
5 1 1
6 2 2
7 3 3
8 4 4
9 5 5
other blocks
The other blocks, f1_c14[$1,$2,$4]
AND f1[$1,$2,$4]
only run when the values from those array elements has a value.
Best Answer
You can try: