bash – How to Randomly Extract a Substring of 200 Characters from a Fasta File

bashbioinformaticsrandomtext processing

Is there any Linux command one can use to extract a sequence from a file? For instance, a file contains one million lines, and we want to randomly sample only a sequence of 200 characters from that file (without considering the header).

For random I mean that every 200 sequence gets the same probability to be chosen and none of the substrings chosen are repetitive.

I am thinking of extracting randomly this sequence of 200 characters from a fasta file like this (without considering the header):

>NC_001416.1 Enterobacteria phage lambda, complete genome
GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCG
TCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGC
TTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAACAAAAAGCA
GCTGGCTGACATTTTCGGTGCGAGTATCCGTACCATTCAGAACTGGCAGGAACAGGGAATGCCCGTTCTG
CGAGGCGGTGGCAAGGGTAATGAGGTGCTTTATGACTCTGCCGCCGTCATAAAATGGTATGCCGAAAGGG
ATGCTGAAATTGAGAACGAAAAGCTGCGCCGGGAGGTTGAAGAACTGCGGCAGGCCAGCGAGGCAGATCT
CCAGCCAGGAACTATTGAGTACGAACGCCATCGACTTACGCGTGCGCAGGCCGACGCACAGGAACTGAAG
AATGCCAGAGACTCCGCTGAAGTGGTGGAAACCGCATTCTGTACTTTCGTGCTGTCGCGGATCGCAGGTG
AAATTGCCAGTATTCTCGACGGGCTCCCCCTGTCGGTGCAGCGGCGTTTTCCGGAACTGGAAAACCGACA
TGTTGATTTCCTGAAACGGGATATCATCAAAGCCATGAACAAAGCAGCCGCGCTGGATGAACTGATACCG
GGGTTGCTGAGTGAATATATCGAACAGTCAGGTTAACAGGCTGCGGCATTTTGTCCGCGCCGGGCTTCGC
TCACTGTTCAGGCCGGAGCCACAGACCGCCGTTGAATGGGCGGATGCTAATTACTATCTCCCGAAAGAAT
CCGCATACCAGGAAGGGCGCTGGGAAACACTGCCCTTTCAGCGGGCCATCATGAATGCGATGGGCAGCGA
CTACATCCGTGAGGTGAATGTGGTGAAGTCTGCCCGTGTCGGTTATTCCAAAATGCTGCTGGGTGTTTAT
GCCTACTTTATAGAGCATAAGCAGCGCAACACCCTTATCTGGTTGCCGACGGATGGTGATGCCGAGAACT
TTATGAAAACCCACGTTGAGCCGACTATTCGTGATATTCCGTCGCTGCTGGCGCTGGCCCCGTGGTATGG
CAAAAAGCACCGGGATAACACGCTCACCATGAAGCGTTTCACTAATGGGCGTGGCTTCTGGTGCCTGGGC
GGTAAAGCGGCAAAAAACTACCGTGAAAAGTCGGTGGATGTGGCGGGTTATGATGAACTTGCTGCTTTTG
ATGATGATATTGAACAGGAAGGCTCTCCGACGTTCCTGGGTGACAAGCGTATTGAAGGCTCGGTCTGGCC
AAAGTCCATCCGTGGCTCCACGCCAAAAGTGAGAGGCACCTGTCAGATTGAGCGTGCAGCCAGTGAATCC
CCGCATTTTATGCGTTTTCATGTTGCCTGCCCGCATTGCGGGGAGGAGCAGTATCTTAAATTTGGCGACA
AAGAGACGCCGTTTGGCCTCAAATGGACGCCGGATGACCCCTCCAGCGTGTTTTATCTCTGCGAGCATAA
TGCCTGCGTCATCCGCCAGCAGGAGCTGGACTTTACTGATGCCCGTTATATCTGCGAAAAGACCGGGATC

So I can obtain as an example a sequence subset like this:

GCATACCAGGAAGGGCGCTGGGAAACACTGCCCTTTCAGCGGGCCATCATGAATGCGATGGGCAGCGACTACATCCGTGAGGTGAATGTGGTGAAGTCTGCCCGTGTCGGTTATTCCAAAATGCTGCTGGGTGTTTATGCCTACTTTATAGAGCATAAGCAGCGCAACACCCTTATCTGGTTGCCGACGGATGGTGATGC

Best Answer

In order to be able to do multiple fast selections of random 200-character length sequences, it's convenient to save a copy of the fasta file without the newlines (excluding the header also).

< file.fasta tail -n+2 | tr -d '\n' > newfile

So you will be randomly selecting the start character without hitting any newline character and/or doing any calculations to deal with it. Also I assume that wc -c < file (or wc -m) and stat -c "%s" file give the same result (for usual content, locales etc, check it first), so we use stat which returns faster.

For a file with n characters, the available choices are n-200, we exclude from possible start position the last 200 characters because they can not form a 200-character long string.

shuf selects the random number for the range 1,n-200 and a combination of head and tail with -c will extract the string.

n=$(stat -c "%s" newfile)
r=$(shuf -i1-"$((n-200+1))" -n1)
< newfile tail -c+"$r" | head -c200

You could call it many times to get many different independent random selections, that means even the same or overlapping sequences.

If you want your selections to follow other criteria, like not at the same position in file and/or not overlapping, you would have to parse random numbers from the same shuf command (with higher -n value). Or, for not overlapping, discard any new values closer than 200 to the existing ones.

If you want to select x random independent but unique sequences you can start producing random lines, remove duplicates and keep x of them with head, for example to get 10 of them:

while true; do sh test.sh; printf "\n"; done | awk '!seen[$0]++' | head
Related Question