I need some help with a combination of awk
& while
loop.
I have two simple files with columns (normal ones are very large), one representing simple intervals for an ID=10(of coding regions(exons),for chromosome 10 here):
#exons.bed
10 60005 60100
10 61007 61130
10 61200 61300
10 61500 61650
10 61680 61850
and the other representing sequenced reads(=just intervals again but smaller) with an other value as last column, that I ll need later:
#reads.bed
10 60005 60010 34
10 61010 61020 40
10 61030 61040 22
10 61065 61070 35
10 61100 61105 41
So, I would like to search in a quick and efficient way and find which read intervals (of which line in the file) and how many, fall in one coding region:
exon 1(first interval of table 1) contains reads of line 1,2,3, etc.
of reads.file(2nd table)
so that I can get the value of 4th column of these lines later, for each exon.
I 've written a code,that probably needs some corrections on the while loop, since I cannot make it parse the reads lines one by one for each awk. Here it is:
while read chr a b cov; do #for the 4-column file
#if <a..b> interval of read falls inside exon interval:
awk '($2<=$a && $b <= $3) {print NR}' exons.bed >> out_lines.bed
done < reads.bed
At the moment I can make the awk line running when I give manually a,b, but I want to make it run automatically for each a,b pair by file.
Any suggestion on changing syntax, or way of doing it, is highly appreciated!
FOLLOW UP
Finally I worked it out with this code:
awk 'NR==FNR{
a[NR]=$2;
b[NR]=$3;
next; }
{ #second file
s[i]=0; m[i]=0; k[i]=0; # Add sum and mean calculation
for (i in a){
if($2>=a[i] && $3<=b[i]){ # 2,3: cols of second file here
k[i]+=1
print k #Count nb of reads found in
out[i]=out[i]" "FNR # keep Nb of Line of read
rc[i]=rc[i]" "FNR"|"$4 #keep Line and cov value of $4th col
s[i]= s[i]+$4 #sum over coverages for each exon
m[i]= s[i]/k[i] #Calculate mean (k will be the No or
#reads found on i-th exon)
}}
}
END{
for (i in out){
print "Exon", i,": Reads with their COV:",rc[i],\
"Sum=",s[i],"Mean=",m[i] >> "MeanCalc.txt"
}}' exons.bed reads.bed
OUTPUT:
Exon 2 : Reads with their COV: 2|40 3|22 4|35 5|41 Sum= 138 Mean= 34.5
etc.
Best Answer
The first issue is that you can't use bash variables inside
awk
like that.$a
withinawk
evaluates to fielda
buta
is empty since it is not defined inawk
, but inbash
. One way around this is to useawk
's-v
option to define the variableSo, you could do:
You have another mistake there though. In order for a read to fall within an exon, the read's start position must be greater than the start position of the exon and its end position smaller than the end position of the exon. You are using
$2<=a && b <= $3
which will select reads whose start is outside the exon's boundaries. What you want is$2>=a && $3<=b
.In any case, running this type of thing in a bash loop is very inefficient since it needs to read the input file once for every pair of
a
andb
. Why not do the whole thing inawk
?The script above produces the following output if run on your example files:
Here's the same thing in a less condensed form for clarity