Output part of each line to a separate file

awksedtext processing

I have a file like this:

a   AGTACTTCCAGGAACGGTGCACTCTCC
b   ATGGATTTTTGGAGCAGGGAGATGGAATAGGAGCATGCTCCAT
c   ATATTAAATGGATTTTTGGAGCAGGGAGATGGAATAGGAGCATGCTCCATCCACTCCACAC
d   ATCAGTTTAATATCTGATACGTCCTCTATCCGAGGACAATATATTAAATGGA
e   TTTGGCTAAGATCAAGTGTAGTATCTGTTCTTATAAGTTTAATATCTGATATGTCCTCTATCTGA

I want to make file a.seq which contains sequence AGTACTTCCAGGAACGGTGCACTCTCC.
Similarly b.seq contains ATGGATTTTTGGAGCAGGGAGATGGAATAGGAGCATGCTCCAT. In short, Column1 should be used as output file name with extension .seq and then it should have corresponding column2 sequence in it. I can do this by writing a perl script but anything on command line will be helpful. Hope to hear soon.

Best Answer

My snap response would have been awk but if you're processing lots of lines —and I'm talking about millions— you'll likely see a real benefit from switching to a "real" programming language.

With that in mind (and awk already being taken as an answer) I wrote a few implementations in different languages and benchmarked them on the same 10,000-line dataset on a PCI-E SSD.

me* (C)                0m1.734s
me (C++)               0m1.991s
me (Python/Pypy)       0m2.390s
me (perl)              0m3.024s
Thor+Glenn (sed|sh)    0m3.353s
me (python)            0m3.359s
jasonwryan+Thor (awk)  0m3.779s
rush (while read)      0m6.011s
Thor (sed)             1m30.947s
me (parallel)          4m9.429s

At a glance the C looks best but it was a pig to get to run that fast. Pypy and C++ are much easier to write and perform well enough unless you're talking about many billions of lines. If that were the case, an upgrade to doing this all in RAM or on an SSD might be a better investment than a code improvement.

Obviously in the time I've spent going through these you could probably have processed a few hundred million records in the slowest option. If you can only write awk or Bash loops, do that and get on with life. I clearly had too much spare time today.

I did also test some multi-threaded options (in C++ and Python and hybrids with GNU parallel) but the overhead of threads completely outweighs any benefit for such a simple operation (string splitting, writing).

Perl

awk (gawk here) would honestly be my first port of call for testing data like this but you can do fairly similar things in Perl. Similar syntax but with a slightly better writing handle.

perl -ane 'open(my $fh, ">", $F[0].".seq"); print $fh $F[1]; close $fh;' infile

Python

I like Python. It's my day job language and it's just a nice, solid and incredibly readable language. Even a beginner could probably guess what's happening here.

with open("infile", "r") as f:
    for line in f:
        id, chunk = line.split()
        with open(id + ".seq", "w") as fw:
            fw.write(chunk)

You have to remember that your distribution's python binary isn't the only implementation of Python out there. When I ran this same test through Pypy, it was faster than C without any further logic optimisation. Keep that in mind before writing Python off as a "slow language".

C

I started this example to see what we could really get my CPU to do but frankly, C is a nightmare to code if you haven't touched it in a long time. This has the added downside of being limited to 100-char lines though it's very simple to expand that, I just didn't need it.

My original version was slower than C++ and pypy but after blogging about it I got some help from Julian Klode. This version is now the fastest because of its tweaked IO buffers. It's also a lot longer and more involved than anything else.

#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <stdlib.h>

#define BUFLEN (8 * 1024)

int main(void) {
    FILE *fp;
    FILE *fpout;

    char line[100];
    char *id;
    char *token;
    char *buf = malloc(BUFLEN);

    fp = fopen("infile", "r");

    setvbuf ( fp , buf , _IOLBF, BUFLEN );
    while (fgets(line, 100, fp) != NULL) {
        id = strtok(line, "\t");
        token = strtok(NULL, "\t");

        char *fnout = malloc(strlen(id)+5);
        fnout = strcat(fnout, id);
        fnout = strcat(fnout, ".seq");

        fpout = fopen(fnout, "w");
        setvbuf ( fpout , NULL , _IONBF , 0 );
        fprintf(fpout, "%s", token);
        fclose(fpout);
    }
    fclose(fp);

    return 0;
}

C++

Performs well and is much easier to write than real C. You have all sorts of things that hold your hand (especially when it comes to strings and input). All that means you can actually simplify the logic down. strtok in C is a hog because it processes the entire string and then we need to do all that tiresome memory allocation. This just flits along the line until it hits the tab and we pull the segments out as we need them.

#include <fstream>
#include <string>
using namespace std;

int main(void) {
    ifstream in("infile");
    ofstream out;
    string line;

    while(getline(in, line)) {
        string::size_type tab = line.find('\t', 0);
        string filename = line.substr(0, tab) + ".seq";
        out.open(filename.c_str());
        out << line.substr(tab + 1);
        out.close();
    }

    in.close();
}

GNU Parallel

(Not the moreutils version). It's a nice concise syntax but OMGSLOW. I might be using it wrong.

parallel --colsep '\t' echo {2} \> {1}.seq <infile

Test harness generator

Here's my data-generator for 100000 lines of [ATGC]*64. It's not fast and improvements are very welcome.

cat /dev/urandom | tr -dc 'ATGC' | fold -w 64 | awk 'NR>100000{exit}{printf NR"\t"$0"\n"}' > infile
Related Question