You can use awk
. Put the following in a script, script.awk
:
FNR == NR {
f1[$1,$2,$4] = $0
f1_c14[$1,$2,$4] = 1
f1_c5[$1,$2,$4] = $5
next
}
f1_c14[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}
f1[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print $0;
}
Now run it like this:
$ awk -f script.awk file1 file2
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
The script works as follows. This block creates 3 arrays, f1
, f1_c14
, and f1_c5
. f1
contains all the lines of file1 in an array, indexed using the contents of the columns 1, 2, & 4 from file1. f1_c14
is another array with the same index (1, 2, & 4's contents) and a value of 1
. The 3rd array uses the same index as the 1st 2, with the value of the 5th column from file1.
FNR == NR {
f1[$1,$2,$4] = $0
f1_c14[$1,$2,$4] = 1
f1_c5[$1,$2,$4] = $5
next
}
The next block is responsible for printing lines from the 1st file, file1
under the conditions that the columns 1, 2, & 4 match the columns from file2
, AND it will onlu print the line from file1
if the 5th columns of file1
and file2
do not match.
f1_c14[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print f1[$1,$2,$4];
}
The 3rd block is responsible for printing the associated line from file2
there's a corresponding line in the array f1
for file2
's columns 1, 2, & 4. Again it only prints if the 5th columns do not match.
f1[$1,$2,$4] {
if ($5 != f1_c5[$1,$2,$4]) print $0;
}
Example
Running the above script like so:
$ awk -f script.awk file1 file2
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
You can use the column
command to clean up the output slightly:
$ awk -f script.awk file1 file2 | column -t
sc2/80 20 . A T 86 PASS N=2 F=5;U=4
sc2/80 20 . A C 80 PASS N=2 F=5;U=4
sc2/60 55 . G T 76 PASS N=2 F=5;U=4
sc2/60 55 . G C 72 PASS N=2 F=5;U=4
How it works?
FNR == NR
This makes use of awk
's ability to loop through files in a particular way. Here's we're looping through the files and when we're on a line that's from the first file, file
, we want to run a particular block of code on this line from file1
.
This example shows what FNR == NR
is doing when we give it 2 simulated files. One has 4 lines in it while the other has 5 lines:
$ awk 'BEGIN {print "NR\tFNR\tline"} {print NR"\t"FNR"\t"$0}' \
<(seq 1 4) <(seq 1 5)
NR FNR line
1 1 1
2 2 2
3 3 3
4 4 4
5 1 1
6 2 2
7 3 3
8 4 4
9 5 5
other blocks
The other blocks, f1_c14[$1,$2,$4]
AND f1[$1,$2,$4]
only run when the values from those array elements has a value.
Here's a solution using just awk
. Put the below code in a file called ex.awk
:
BEGIN{}
FNR==NR{
k=$1" "$2
a[k]=$4" "$5
b[k]=$0
c[k]=$4
d[k]=$5
next
}
{ k=$1" "$2
lc=c[k]
ld=d[k]
# file1 file2
if ((k in a) && ($4==$5) && (lc==$4) || (ld==$5)) print b[k]" "$0
}
And then run it like this with the above 2 files:
$ awk -f ex.awk file1 file2
Example
The sed
is just to format the output for StackExchange!
$ awk -f ex.awk file1 file2 | sed 's/[ ]\+/ /g'
s2/90 60 . C G 30 N=2 F=5;U=4 s2/90 60 . G G 97 N=2 F=5;U=4
s2/80 20 . A T 86 N=2 F=5;U=4 s2/80 20 . A A 20 N=2 F=5;U=4
s2/20 10 . G T 90 N=2 F=5;U=4 s2/20 10 . G G 99 N=2 F=5;U=4
A change in requirements
The OP mentioned in the comments below that he'd like the ultimate solution to drop any lines where the 4th and 5th columns from file1
matched the 4th and 5th columns from file2
.
For example, add this line to both file1
& file2
:
s2/40 40 . S S 90 N=2 F=5;U=4
A single line addition to the original solution can address this particular change in the requirements.
if ((k in a) && (lc==$4) && (ld==$5)) next
New Example
ex2.awk
:
BEGIN{}
FNR==NR{
k=$1" "$2
a[k]=$4" "$5
b[k]=$0
c[k]=$4
d[k]=$5
next
}
{ k=$1" "$2
lc=c[k]
ld=d[k]
if ((k in a) && (lc==$4) && (ld==$5)) next
if ((k in a) && ($4==$5) && (lc==$4) || (ld==$5)) print b[k]" "$0
}
Rerunning the new awk
script, ex2.awk
:
$ awk -f ex2.awk file1 file2 | sed 's/[ ]\+/ /g'
s2/90 60 . C G 30 N=2 F=5;U=4 s2/90 60 . G G 97 N=2 F=5;U=4
s2/80 20 . A T 86 N=2 F=5;U=4 s2/80 20 . A A 20 N=2 F=5;U=4
s2/20 10 . G T 90 N=2 F=5;U=4 s2/20 10 . G G 99 N=2 F=5;U=4
Best Answer
If files are sorted much easy do the task by
diff