You can use awk
. Put the following in a script, script.awk
:
FNR == NR {
f1[$1,$2,$4] = $0
f1_c14[$1,$2,$4] = 1
f1_c5[$1,$2,$4] = $5
next
}
f1_c14[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}
f1[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print $0;
}
Now run it like this:
$ awk -f script.awk file1 file2
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
The script works as follows. This block creates 3 arrays, f1
, f1_c14
, and f1_c5
. f1
contains all the lines of file1 in an array, indexed using the contents of the columns 1, 2, & 4 from file1. f1_c14
is another array with the same index (1, 2, & 4's contents) and a value of 1
. The 3rd array uses the same index as the 1st 2, with the value of the 5th column from file1.
FNR == NR {
f1[$1,$2,$4] = $0
f1_c14[$1,$2,$4] = 1
f1_c5[$1,$2,$4] = $5
next
}
The next block is responsible for printing lines from the 1st file, file1
under the conditions that the columns 1, 2, & 4 match the columns from file2
, AND it will onlu print the line from file1
if the 5th columns of file1
and file2
do not match.
f1_c14[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}
The 3rd block is responsible for printing the associated line from file2
there's a corresponding line in the array f1
for file2
's columns 1, 2, & 4. Again it only prints if the 5th columns do not match.
f1[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print $0;
}
Example
Running the above script like so:
$ awk -f script.awk file1 file2
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
You can use the column
command to clean up the output slightly:
$ awk -f script.awk file1 file2 | column -t
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
How it works?
FNR == NR
This makes use of awk
's ability to loop through files in a particular way. Here's we're looping through the files and when we're on a line that's from the first file, file
, we want to run a particular block of code on this line from file1
.
This example shows what FNR == NR
is doing when we give it 2 simulated files. One has 4 lines in it while the other has 5 lines:
$ awk 'BEGIN {print "NR\tFNR\tline"} {print NR"\t"FNR"\t"$0}' \
<(seq 1 4) <(seq 1 5)
NR FNR line
1 1 1
2 2 2
3 3 3
4 4 4
5 1 1
6 2 2
7 3 3
8 4 4
9 5 5
other blocks
The other blocks, f1_c14[$1,$2,$4]
AND f1[$1,$2,$4]
only run when the values from those array elements has a value.
Best Answer
Here's a solution using just
awk
. Put the below code in a file calledex.awk
:And then run it like this with the above 2 files:
Example
The
sed
is just to format the output for StackExchange!A change in requirements
The OP mentioned in the comments below that he'd like the ultimate solution to drop any lines where the 4th and 5th columns from
file1
matched the 4th and 5th columns fromfile2
.For example, add this line to both
file1
&file2
:A single line addition to the original solution can address this particular change in the requirements.
New Example
ex2.awk
:Rerunning the new
awk
script,ex2.awk
: