Awk Grep – Count Number of Substring Repetitions in a String

awkgrep

I have a file which contains a gene sequence such as:

ATGTGGATGGTGGGTTACAATGAAGGTGGTGAGTTCAACATGGCTGATTATCCATTCAGTGGAAGGAAACTAAGGCCTCTCATTCCAAGACCAGTCCCAGTCCCTACTACTTCTCCTAACAGCACTTCAACTATAACTCCTTCCTTAAACCGCATTCATGGTGGCAATGATTTATTTTCACAATATCATCACAATCTGCAGCAGCAAGCATCAGTAGGAGATCATAGCAAGAGATCAGAGTTGAATAATAATAATAATCCATCTGCAGCAGTTGTGGTGAGTTCAAGATGGAATCCAACACCAGAACAGTTAAGAGCACTGGAAGAATTGTATAGAAGAGGAACAAGAACACCTTCTGCTGAGCAAATCCAACAAATAACTGCCCAGCTTAGAAAATTTGGAAAAATTGAAGGCAAAAATGTTTTCTATTGGTTTCAGAATCACAAAGCCAGAGAAAGGCAAAAACGACGGCGTCAAATGGAATCAGCAGCTGCTGAGTTTGATTCTGCTATTGAAAAGAAAGACTTAGGCGCAAGTAGG


ACAGTGTTTGAAGTTGAACACACTAAAAACTGGCTACCATCTACAAATTCCAGTACCAGTACTCTTCATCTTGCAGAGGAATCTGTTTCAATTCAAAGGTCAGCAGCAGCAAAAGCAGATGGATGGCTCCAATTCGATGAAGCAGAATTACAGCAAAGAAGAAACTTTATGGAAAGGAATGCCACGTGGCATATGATGCAGTTAACTTCTTCTTGTCCTACAGCTAGCATGTCCACCACAACCACAGTAACAACTAGACTTATGGACCCAAAACTCATCAAGACCCATGAACTCAACTTATTCATTTCACCTCACACATACAAAGAAAGAGAAAACGCTTTTATCCACTTAAATACTAGTAGTACTCATCAAAATGAATCTGATCAAACCCTTCAACTTTTCCCAATAAGGAATGGAGATCATGGATGCACTGATCATCATCATCATCATCATAACATTATCAAAGAGACACAGATATCAGCTTCAGCAATCAATGCACCCAACCAGTTTATTGAGTTTCTTCCCTTGAAAAACTGA

I am trying to count the number of occurrence of "ATG" substring in the above string (which is only one line without line breaks.) My file contains tens (10s) of these sequences and I want to be able to count how many "ATG" in each sequence. Each sequence is separated from others by an empty line.

I tried grep but did not know which options I should use (if at all grep can solve the problem) and I googled for any awk example but I did not find any.

Best Answer

Returns the number of occurrences of ATG in each line:

awk -F'ATG' 'NF{print NF-1}' testfile

This works for files with one or many lines.

Example 1

Consider this test file:

$ cat testfile
xxATGxxATG

ATGxxxATGxxx

xxATGxxxxATGxxATGxx

The code correctly counts the occurrences of ATG:

$ awk -F'ATG' 'NF{print NF-1}' testfile
2
2
3

Example 2

Using the example in the current version of the question:

$ cat >file1
ATGTGGATGGTGGGTTACAATGAAGGTGGTGAGTTCAACATGGCTGATTATCCATTCAGTGGAAGGAAACTAAGGCCTCTCATTCCAAGACCAGTCCCAGTCCCTACTACTTCTCCTAACAGCACTTCAACTATAACTCCTTCCTTAAACCGCATTCATGGTGGCAATGATTTATTTTCACAATATCATCACAATCTGCAGCAGCAAGCATCAGTAGGAGATCATAGCAAGAGATCAGAGTTGAATAATAATAATAATCCATCTGCAGCAGTTGTGGTGAGTTCAAGATGGAATCCAACACCAGAACAGTTAAGAGCACTGGAAGAATTGTATAGAAGAGGAACAAGAACACCTTCTGCTGAGCAAATCCAACAAATAACTGCCCAGCTTAGAAAATTTGGAAAAATTGAAGGCAAAAATGTTTTCTATTGGTTTCAGAATCACAAAGCCAGAGAAAGGCAAAAACGACGGCGTCAAATGGAATCAGCAGCTGCTGAGTTTGATTCTGCTATTGAAAAGAAAGACTTAGGCGCAAGTAGG


ACAGTGTTTGAAGTTGAACACACTAAAAACTGGCTACCATCTACAAATTCCAGTACCAGTACTCTTCATCTTGCAGAGGAATCTGTTTCAATTCAAAGGTCAGCAGCAGCAAAAGCAGATGGATGGCTCCAATTCGATGAAGCAGAATTACAGCAAAGAAGAAACTTTATGGAAAGGAATGCCACGTGGCATATGATGCAGTTAACTTCTTCTTGTCCTACAGCTAGCATGTCCACCACAACCACAGTAACAACTAGACTTATGGACCCAAAACTCATCAAGACCCATGAACTCAACTTATTCATTTCACCTCACACATACAAAGAAAGAGAAAACGCTTTTATCCACTTAAATACTAGTAGTACTCATCAAAATGAATCTGATCAAACCCTTCAACTTTTCCCAATAAGGAATGGAGATCATGGATGCACTGATCATCATCATCATCATCATAACATTATCAAAGAGACACAGATATCAGCTTCAGCAATCAATGCACCCAACCAGTTTATTGAGTTTCTTCCCTTGAAAAACTGA

This results in:

$ awk -F'ATG' 'NF{print NF-1}' file1
9
15

How it works

awk implicitly loops through every line of a file. Each line is divided into fields.

-F'ATG'

This tells awk to use ATG as the field separator.
NF{print NF-1}

For each non-empty line, this tells awk to print the number of fields minus 1.

(On empty lines, the number of fields, NF, is zero. So, the condition NF evaluates to false on these lines, effectively skipping over them.)

Related Solutions

AWK for filling up rest of the columns with &nbsp in file

The easiest way is to do it in two steps, first find the widest line:

max=$(awk 'max < NF { max = NF } END { print max }' infile)

Then use that as input when filling out the other columns:

awk -v max=$max '{ for(i=NF+1; i<=max; i++) $i = "N/A"; print }' infile

awk grep – Extract Lines Containing Terms from One File to Another Using Grep/Awk

To extract the lines from data.txt with the genes listed in genelist.txt:

grep -w -F -f genelist.txt data.txt > newdata.txt

grep options used:

-w tells grep to match whole words only (i.e. so ABC123 won't also match ABC1234).
-F search for fixed strings (plain text) rather than regular expressions
-f genelist.txt read search patterns from the file

If you want the header (Sample 1, Sample 2, etc) line as well:

grep -w -F -f genelist.txt -e Sample data.txt > newdata.txt

-e Sample also search for "Sample"

To find lines in genelist.txt that aren't in newdata.txt:

grep -v -w -F -f <(sed -E -e 's/(\t|  +).*//' newdata.txt) genelist.txt

-v invert the search, print non-matching lines.

The rest of the grep options are the same, but instead of using a file with the -f option, it's using something called Process Substitution (See also), which allows you to use a command in place of an actual file. Whatever output the command creates is treated as the "file"'s contents.

In this case, we're using the command sed -E -e 's/(\t| +).*//' newdata.txt, which outputs each line of newdata.txt after first deleting everything from either the first TAB character or the first pair of spaces it sees. In other words, the first field (e.g. "Gene A"). I had to use TAB or double space because a) i wasn't sure if your data was space-separated or TAB separated and b) the first fields in your example contained spaces.

sed options used:

-E use extended regular expressions, so we can use plain (, ), and + which are more readable than having to escape them with \ as \(, \), \+.
-e 's/(\t| +).*//' specifies the sed script to apply against the input (newdata.txt)

Running that command on your sample data.txt would produce the following output:

$ sed -E -e 's/(\t|  +).*//' data.txt

Gene A
Gene B
Gene C
Gene D

Anyway, the output of that sed command is used as the list of search patterns by the grep command.