Bash – Random mutagenesis with bash

bashscripting

I have a string e.g.

1234567890

and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.

ABCDEFGHIJ
KLMNOPQRST
UVWXYZABCD
...

If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:

12C456GB90

Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.

EDIT: At the moment I am using something like this:

#chose random number between 1 and the number of characters in the string
randomposition=$(jot -r 1 1 $seqpositions)
#chose a random number between 1 and the number of lines in the set of potential replacement strings
randomline=$(jot -r 1 1 $alignlines)
#find the character at randomline:randomposition
newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)
#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'
sed "s/./$newAA/$randomposition" $sequencefile

(with some additional bits, obviously) and just looping through this thousands of times

Best Answer

Note:

This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)

The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.

#! /usr/bin/perl
# usage mutagen number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;

sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;
    for my $i (0..$max){
        my $p = $ip[$i];
        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
}

Usage example:

$ cat seq
1634870295
5684937021
2049163587
6598471230
$ cat alg
DPMBHZJEIO
INTMJZOYKQ
KNTXGLCJSR
GLJZRFVSEX
SYJVHEPNAZ
$ perl mutagen 3 alg seq
1L3V8702I5
5684HE7Y21
2049JZC587
6598H7C2E0

If the generated n random numbers have to be different between them, then prand should be changed to:

sub prand {
    my (@r, $m, %h);
    die "more replacements than positions/alignments" if $max >= $_[0];
    for(0..$max){
        my $r = int(rand() * $_[0]);
        $r = ($r + 1) % $_[0] while $h{$r};
        $h{$r} = 1;
        push @r, $r;
    }
    @r;
}

A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:

#! /usr/bin/perl
# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
use strict;

my $debug = $ARGV[0] eq '-d' ? shift : 0;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;

sub prand { map int(rand() * $_[0]), 0..$max } 
while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;

    if($debug){
        my $t = ' ' x (length() - 1);
        substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
        warn "@ip | @op\n    $_    $t\n";
        for my $i (0..$max){
            my $t = $alg[$op[$i]];
            $t =~ s/(.{$ip[$i]})(.)/$1\e[1;31m$2\e[m/;
            printf STDERR " %2d %s", $op[$i], $t;
        }
    }
    for my $i (0..$max){
        my $p = $ip[$i];
        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
    if($debug){
        my @t = split "", $_;
        for my $i (0..$max){
            $_ = "\e[1;31m$_\e[m" for $t[$ip[$i]];
        }
        warn "  = ", @t, "\n";
    }
}
Related Question