Bash – Random mutagenesis with bash

bashscripting

I have a string e.g.

1234567890

and I want to replace random positions of that string with corresponding position from a random sequence in another set of other strings e.g.

ABCDEFGHIJ
KLMNOPQRST
UVWXYZABCD
...

If I chose to make 3 replacements, the script should chose 3 random numbers e.g. 3,7,8; and 3 random sequences e.g. 1, 1, 3; make the replacements to generate the expected output:

12C456GB90

Is there a way to do this without significant looping? I wrote a simple bash script to generate a random position and a random sequence line then do 1 replacement, then repeat the process on the output, repeat, repeat. This works perfectly, however in my real-life files (much larger than the examples), I want to generate 10,000 or more replacements. Oh, and I will need to do this multiple times to generate multiple 'mutated' variant sequences.

EDIT: At the moment I am using something like this:

#chose random number between 1 and the number of characters in the string
randomposition=$(jot -r 1 1 $seqpositions)
#chose a random number between 1 and the number of lines in the set of potential replacement strings
randomline=$(jot -r 1 1 $alignlines)
#find the character at randomline:randomposition
newAA=$(sed -n "$randomline,$randomline p" $alignmentfile | cut -c$randomposition)
#replace the character at 'string:randomposition' with the character at 'randomline:randomposition'
sed "s/./$newAA/$randomposition" $sequencefile

(with some additional bits, obviously) and just looping through this thousands of times

Best Answer

Note:

This is strictly for amusement purposes; an equivalent program in C would be much simpler and orders of magnitude faster; as to bash, let's not even talk about ;-)

The following perl script will mutate a list of ~1M sequences, and ~10k alignments in about 10 seconds on my laptop.

#! /usr/bin/perl
# usage mutagen number_of_replacements alignment_file [ sequence_file ..]
use strict;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;

sub prand { map int(rand() * $_[0]), 0..$max }
while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;
    for my $i (0..$max){
        my $p = $ip[$i];
        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
}

Usage example:

$ cat seq
1634870295
5684937021
2049163587
6598471230
$ cat alg
DPMBHZJEIO
INTMJZOYKQ
KNTXGLCJSR
GLJZRFVSEX
SYJVHEPNAZ
$ perl mutagen 3 alg seq
1L3V8702I5
5684HE7Y21
2049JZC587
6598H7C2E0

If the generated n random numbers have to be different between them, then prand should be changed to:

sub prand {
    my (@r, $m, %h);
    die "more replacements than positions/alignments" if $max >= $_[0];
    for(0..$max){
        my $r = int(rand() * $_[0]);
        $r = ($r + 1) % $_[0] while $h{$r};
        $h{$r} = 1;
        push @r, $r;
    }
    @r;
}

A debug-enabled version, that will pretty-print the mutation with colors when given the -d switch:

#! /usr/bin/perl
# usage mutagen [-d] number_of_replacements alignment_file [ sequence_file ..]
use strict;

my $debug = $ARGV[0] eq '-d' ? shift : 0;
my $max = shift() - 1;
my $algf = shift;
open my $alg, $algf or die "open $algf: $!";
my @alg = <$alg>;

sub prand { map int(rand() * $_[0]), 0..$max } 
while(<>){
    my @ip = prand length() - 1;
    my @op = prand scalar @alg;

    if($debug){
        my $t = ' ' x (length() - 1);
        substr $t, $ip[$_], 1, $ip[$_] for 0..$max;
        warn "@ip | @op\n    $_    $t\n";
        for my $i (0..$max){
            my $t = $alg[$op[$i]];
            $t =~ s/(.{$ip[$i]})(.)/$1\e[1;31m$2\e[m/;
            printf STDERR " %2d %s", $op[$i], $t;
        }
    }
    for my $i (0..$max){
        my $p = $ip[$i];
        substr $_, $p, 1, substr $alg[$op[$i]], $p, 1;
    }
    print;
    if($debug){
        my @t = split "", $_;
        for my $i (0..$max){
            $_ = "\e[1;31m$_\e[m" for $t[$ip[$i]];
        }
        warn "  = ", @t, "\n";
    }
}

Related Solutions

Bash – Replace multiple strings in a single pass

OK, a general solution. The following bash function requires 2k arguments; each pair consists of a placeholder and a replacement. It's up to you to quote the strings appropriately to pass them into the function. If the number of arguments is odd, an implicit empty argument will be added, which will effectively delete occurrences of the last placeholder.

Neither placeholders nor replacements may contain NUL characters, but you may use standard C \-escapes such as \0 if you need NULs (and consequently you are required to write \\ if you want a \).

It requires the standard build tools which should be present on a posix-like system (lex and cc).

replaceholder() {
  local dir=$(mktemp -d)
  ( cd "$dir"
    { printf %s\\n "%option 8bit noyywrap nounput" "%%"
      printf '"%s" {fputs("%s", yyout);}\n' "${@//\"/\\\"}"
      printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
    } | lex && cc lex.yy.c
  ) && "$dir"/a.out
  rm -fR "$dir"
}

We assume that \ is already escaped if necessary in the arguments but we need to escape double quotes, if present. That's what the second argument to the second printf does. Since the lex default action is ECHO, we don't need to worry about it.

Example run (with timings for the skeptical; it's just a cheap-o commodity laptop):

$ time echo AB | replaceholder A B B A
BA

real    0m0.128s
user    0m0.106s
sys     0m0.042s
$ time printf %s\\n AB{0000..9999} | replaceholder A B B A > /dev/null

real    0m0.118s
user    0m0.117s
sys     0m0.043s

For larger inputs it might be useful to provide an optimization flag to cc, and for current Posix compatibility, it would be better to use c99. An even more ambitious implementation might try to cache the generated executables instead of generating them each time, but they're not exactly expensive to generate.

Edit

If you have tcc, you can avoid the hassle of creating a temporary directory, and enjoy the faster compile time which will help on normal sized inputs:

treplaceholder () { 
  tcc -run <(
  {
    printf %s\\n "%option 8bit noyywrap nounput" "%%"
    printf '"%s" {fputs("%s", yyout);}\n' "${@//\"/\\\"}"
    printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
  } | lex -t)
}

$ time printf %s\\n AB{0000..9999} | treplaceholder A B B A > /dev/null

real    0m0.039s
user    0m0.041s
sys     0m0.031s

Bash – How to Efficiently Generate Large, Uniformly Distributed, Random Integers in Bash

I see another interesting method from here.

rand=$(openssl rand 4 | od -DAn)

This one also seems to be a good option. It reads 4 bytes from the random device and formats them as unsigned integer between 0 and 2^32-1.

rand=$(od -N 4 -t uL -An /dev/urandom | tr -d " ")

Best Answer

Related Solutions

Bash – Replace multiple strings in a single pass

Bash – How to Efficiently Generate Large, Uniformly Distributed, Random Integers in Bash

Related Question