I've written a grep loop to iteratively count DNA trinucleotides within a gzipped DNA fasta file containing DNA sequences e.g.
declare -a tri=(AAA AAC AAG AAT CAA .. etc)
for i in ${tri[@]}
do
gzip -cd gencode.v18.pc_transcripts.fa.gz | grep -v "^>" | grep -o $i | wc -l
done
Where the fasta file is in this format (though much much bigger)
head test.fa
>id1
TTTTTAAAAA
>id2
GGGGGCCCCC
etc..
Whilst this works (i.e. counts occurrences of each trinucleotide) it is to my mind quite inefficient as it has to pass through the data 64 times (once for each possible trinucleotide).
My question is how using bash
or grep
is there a way I can count each trinucleotide in a single pass through the file (as the files are quite large)?
thx
Best Answer
But by the way,
AAAC
matches bothAAA
andAAC
, butgrep -o
will output only one of them. Is that what you want? Also, how many occurrences ofAAA
inAAAAAA
? 2 or 4 ([AAA]AAA
,A[AAA]AA
,AA[AAA]A
,AAA[AAA]
)?Maybe you want instead:
That is split the lines in groups of 3 characters and count the occurrences as full lines (would find 0 occurrence of
AAA
inACAAATTCG
(as that'sACA AAT TCG
)).Or on the other hand:
(would find 4 occurrences of
AAA
inAAAAAA
).