Shell – Nested ‘awk’ in a ‘while’ loop, parse two files line by line and compare column values

awkbioinformaticsshell-scripttext processing

I need some help with a combination of awk & while loop.
I have two simple files with columns (normal ones are very large), one representing simple intervals for an ID=10(of coding regions(exons),for chromosome 10 here):

#exons.bed
10  60005   60100   
10  61007   61130   
10  61200   61300   
10  61500   61650   
10  61680   61850

and the other representing sequenced reads(=just intervals again but smaller) with an other value as last column, that I ll need later:

#reads.bed
10  60005   60010    34 
10  61010   61020    40
10  61030   61040    22
10  61065   61070    35 
10  61100   61105    41

So, I would like to search in a quick and efficient way and find which read intervals (of which line in the file) and how many, fall in one coding region:

exon 1(first interval of table 1) contains reads of line 1,2,3, etc. 
of   reads.file(2nd table)

so that I can get the value of 4th column of these lines later, for each exon.

I 've written a code,that probably needs some corrections on the while loop, since I cannot make it parse the reads lines one by one for each awk. Here it is:

while read chr a b cov; do  #for the 4-column file

#if <a..b> interval of read falls inside exon interval:
awk '($2<=$a && $b <= $3) {print NR}' exons.bed >> out_lines.bed

done < reads.bed

At the moment I can make the awk line running when I give manually a,b, but I want to make it run automatically for each a,b pair by file.

Any suggestion on changing syntax, or way of doing it, is highly appreciated!

FOLLOW UP

Finally I worked it out with this code:

    awk 'NR==FNR{
        a[NR]=$2; 
        b[NR]=$3;
        next; }
    {  #second file
    s[i]=0; m[i]=0;  k[i]=0;              # Add sum and mean calculation
    for (i in a){                                            
       if($2>=a[i] && $3<=b[i]){         # 2,3: cols of second file here
          k[i]+=1
          print k                      #Count nb of reads found in
          out[i]=out[i]" "FNR          # keep Nb of Line of read 
          rc[i]=rc[i]" "FNR"|"$4       #keep Line and cov value of $4th col
          s[i]= s[i]+$4                #sum over coverages for each exon
          m[i]= s[i]/k[i]             #Calculate mean (k will be the No or  
                                       #reads found on i-th exon)
     }}  
    }
    END{
       for (i in out){
          print "Exon", i,": Reads with their COV:",rc[i],\
          "Sum=",s[i],"Mean=",m[i] >> "MeanCalc.txt"

    }}' exons.bed  reads.bed

OUTPUT:

   Exon 2 : Reads with their COV:  2|40 3|22 4|35 5|41 Sum= 138  Mean= 34.5
   etc.

Best Answer

The first issue is that you can't use bash variables inside awk like that. $a within awk evaluates to field a but a is empty since it is not defined in awk, but in bash. One way around this is to use awk's -v option to define the variable

-v var=val
--assign var=val
   Assign the value val to the variable var,  before  execution  of
   the  program  begins.  Such variable values are available to the
   BEGIN rule of an AWK program.

So, you could do:

while read chr a b cov; do 
  awk -v a="$a" -v b="$b" '($2<=a && b <= $3) {print NR}' exons.bed > out$a$b 
done < reads.bed

You have another mistake there though. In order for a read to fall within an exon, the read's start position must be greater than the start position of the exon and its end position smaller than the end position of the exon. You are using $2<=a && b <= $3 which will select reads whose start is outside the exon's boundaries. What you want is $2>=a && $3<=b.

In any case, running this type of thing in a bash loop is very inefficient since it needs to read the input file once for every pair of a and b. Why not do the whole thing in awk?

awk 'NR==FNR{a[NR]=$2;b[NR]=$3; next} {
        for (i in a){
           if($2>=a[i] && $3<=b[i]){
            out[i]=out[i]" "FNR 
        }}}
        END{for (i in out){
                   print "Exon",i,"contains reads of line(s)"out[i],\
                   "of reads file" 
        }}' exons.bed reads.bed

The script above produces the following output if run on your example files:

Exon 1 contains reads of line(s) 1 of reads file
Exon 2 contains reads of line(s) 2 3 4 5 of reads file

Here's the same thing in a less condensed form for clarity

#!/usr/bin/awk -f

## While we're reading the 1st file, exons.bed
NR==FNR{
    ## Save the start position in array a and the end 
    ## in array b. The keys of the arrays are the line numbers.
    a[NR]=$2;
    b[NR]=$3; 
    ## Move to the next line, without continuing
    ## the script.
    next;
}
 ## Once we move on to the 2nd file, reads.bed
 {
     ## For each set of start and end positions
     for (i in a){
         ## If the current line's 2nd field is greater than
         ## this start position and smaller than this end position,
         ## add this line number (FNR is the current file's line number)
         ## to the list of reads for the current value of i. 
         if($2>=a[i] && $3<=b[i]){
             out[i]=out[i]" "FNR 
         }
     }
 }
 ## After both files have been processed
 END{
     ## For each exon in the out array
     for (i in out){
         ## Print the exon name and the redas it contains
         print "Exon",i,"contains reads of line(s)"out[i],
             "of reads file" 
        }

Related Solutions

Shell – Loop over a file and read values from two columns into variables

Then that should only be a matter of:

while read discard x y discard; do
  sed "s/A/$x/g;s/B/$y/g" < daily.template > "daily.$x-$y.inp"
done < data.dat

The read command will read one line at a time from data.dat, put the first field (fields being blank separated, and where backslash can be used to escape the field or line separator) in the variable $discard, the second in $x, the third in $y and the rest of the fields in $discard.

Compare two files based on the values in their last column

$ cat fileA fileB | sort -k3,3 -k4,4nr | sort -k3,3 -u
a b c 10
d e f 15
g h i 15
j k l 20
p l m 35
w x z 40

This is a pipeline with three parts:

Concatenate fileA with fileB.
Sort the concatenated file in decreasing numerical order based on the fourth column for each unique value in the third column. The result of this step is
```
a b c 10
d e f 15
d e f 11
g h i 15
j k l 20
j k l 15
p l m 35
w x z 40
```
Sort this again, but remove duplicates and only use the third column as the sort key. Since this will leave the line with the first found instance of the sort key but discard the lines with duplicated sort keys (and with lower values in the fourth column, thanks to the first sort), it will give us the wanted result.

This approach disregards the contents of the first two columns completely.

Best Answer

Related Solutions

Shell – Loop over a file and read values from two columns into variables

Compare two files based on the values in their last column

Related Question