Bash – counting multiple patterns in a single pass with grep

bashbioinformaticsgrepshell-scripttext processing

I've written a grep loop to iteratively count DNA trinucleotides within a gzipped DNA fasta file containing DNA sequences e.g.

declare -a tri=(AAA AAC AAG AAT CAA .. etc)

for i in ${tri[@]}
do
   gzip -cd gencode.v18.pc_transcripts.fa.gz | grep -v "^>" | grep -o $i | wc -l
done

Where the fasta file is in this format (though much much bigger)

head test.fa
>id1
TTTTTAAAAA
>id2
GGGGGCCCCC
etc..

Whilst this works (i.e. counts occurrences of each trinucleotide) it is to my mind quite inefficient as it has to pass through the data 64 times (once for each possible trinucleotide).

My question is how using bash or grep is there a way I can count each trinucleotide in a single pass through the file (as the files are quite large)?

thx

Best Answer

IFS=$'\n'
gzip -dc file.gz | grep -v '^>' | grep -Foe "${tri[*]}" | sort | uniq -c

But by the way, AAAC matches both AAA and AAC, but grep -o will output only one of them. Is that what you want? Also, how many occurrences of AAA in AAAAAA? 2 or 4 ([AAA]AAA, A[AAA]AA, AA[AAA]A, AAA[AAA])?

Maybe you want instead:

gzip -dc file.gz | grep -v '^>' | fold -w3 | grep -Fxe "${tri[*]}" | sort | uniq -c

That is split the lines in groups of 3 characters and count the occurrences as full lines (would find 0 occurrence of AAA in ACAAATTCG (as that's ACA AAT TCG)).

Or on the other hand:

gzip -dc file.gz | awk '
  BEGIN{n=ARGC;ARGC=0}
  !/^>/ {l = length - 2; for (i = 1; i <= l; i++) a[substr($0,i,3)]++}
  END{for (i=1;i<n;i++) printf "%s: %d\n", ARGV[i], a[ARGV[i]]}' "${tri[@]}"

(would find 4 occurrences of AAA in AAAAAA).

Related Question