Bash – counting multiple patterns in a single pass with grep

bashbioinformaticsgrepshell-scripttext processing

I've written a grep loop to iteratively count DNA trinucleotides within a gzipped DNA fasta file containing DNA sequences e.g.

declare -a tri=(AAA AAC AAG AAT CAA .. etc)

for i in ${tri[@]}
do
   gzip -cd gencode.v18.pc_transcripts.fa.gz | grep -v "^>" | grep -o $i | wc -l
done

Where the fasta file is in this format (though much much bigger)

head test.fa
>id1
TTTTTAAAAA
>id2
GGGGGCCCCC
etc..

Whilst this works (i.e. counts occurrences of each trinucleotide) it is to my mind quite inefficient as it has to pass through the data 64 times (once for each possible trinucleotide).

My question is how using bash or grep is there a way I can count each trinucleotide in a single pass through the file (as the files are quite large)?

thx

Best Answer

IFS=$'\n'
gzip -dc file.gz | grep -v '^>' | grep -Foe "${tri[*]}" | sort | uniq -c

But by the way, AAAC matches both AAA and AAC, but grep -o will output only one of them. Is that what you want? Also, how many occurrences of AAA in AAAAAA? 2 or 4 ([AAA]AAA, A[AAA]AA, AA[AAA]A, AAA[AAA])?

Maybe you want instead:

gzip -dc file.gz | grep -v '^>' | fold -w3 | grep -Fxe "${tri[*]}" | sort | uniq -c

That is split the lines in groups of 3 characters and count the occurrences as full lines (would find 0 occurrence of AAA in ACAAATTCG (as that's ACA AAT TCG)).

Or on the other hand:

gzip -dc file.gz | awk '
  BEGIN{n=ARGC;ARGC=0}
  !/^>/ {l = length - 2; for (i = 1; i <= l; i++) a[substr($0,i,3)]++}
  END{for (i=1;i<n;i++) printf "%s: %d\n", ARGV[i], a[ARGV[i]]}' "${tri[@]}"

(would find 4 occurrences of AAA in AAAAAA).

Related Solutions

Grep – How to Run Grep with Multiple AND Patterns

agrep can do it with this syntax:

agrep 'pattern1;pattern2'

With GNU grep, when built with PCRE support, you can do:

grep -P '^(?=.*pattern1)(?=.*pattern2)'

With ast grep:

grep -X '.*pattern1.*&.*pattern2.*'

(adding .*s as <x>&<y> matches strings that match both <x> and <y> exactly, a&b would never match as there's no such string that can be both a and b at the same time).

If the patterns don't overlap, you may also be able to do:

grep -e 'pattern1.*pattern2' -e 'pattern2.*pattern1'

The best portable way is probably with awk as already mentioned:

awk '/pattern1/ && /pattern2/'

With sed:

sed -e '/pattern1/!d' -e '/pattern2/!d'

Please beware that all those will have different regular expression syntax.

Grep – Find Multiple AND Patterns in Any Order

If your version of grep supports PCRE (GNU grep does this with the -P or --perl-regexp option), you can use lookaheads to match multiple words in any order:

grep -P '(?=.*?word1)(?=.*?word2)(?=.*?word3)^.*$'

This won't highlight the words, though. Lookaheads are zero-length assertions, they're not part of the matching sequence.

I think your piping solution should work for that. By default, grep only colors the output when it's going to a terminal, so only the last command in the pipeline does highlighting, but you can override this with --color=always.

grep --color=always foo | grep --color=always bar

Best Answer

Related Solutions

Grep – How to Run Grep with Multiple AND Patterns

Grep – Find Multiple AND Patterns in Any Order

Related Question