Shell – Replace each unique value in all columns with a unique identifier

awklinuxperlshelltext processing

I have a file with 250k rows and 10 columns like:

img1 aa bb cc ...
img2 aa yy dd ...
img3 uu bb ee ...
img4 NA bb tt ...

I want a script that will convert this file to:

img1 1 1 1 ...
img2 1 2 2 ...
img3 2 1 3 ...
img4 0 1 4 ...

A unique value from each column after first one should be replaced with a unique identifier starting from 0, where 0 is reserved for the string "NA".

Also, for each column, I want to generate a file that contains the mapping. For example, the file for 2nd column should be:

NA 0
aa 1
uu 2

Can anyone suggest an elegant solution for this? Any help would be greatly appreciated.

Best Answer

Here's a very simple approach. Works fine for me, using gawk 3.1.7.

#!/usr/bin/awk -f
{
    for(x=2;x<=NF;x++) {
        if(x$x in a) {
            $x=a[x$x]
        } else {
            if($x=="NA") {
                print $x,0 > "column"x
                a[x$x]=0
                $x="0"
            } else {
                m[x]++
                print $x,m[x] > "column"x
                a[x$x]=m[x]
                $x=m[x]
            }
        }
    }
    print $0 > "results"
}

Awk

this awk script will work on an arbitrary number of columns > 2 and order of appearance will be preserved as across then down with no assumptions made about what the columns are (i.e. doesn't matter if they are numeric or not, sorted or not, etc):

{
    for (i = 2; i <= NF; i++) {
        a[j + i] = $1 " " $i
    }
    j += (i - 1);
}
END {
    OutNR = NR * NF;
    for (i = 2; i <= NF; i++) {
        for (j = 0; j < OutNR; j += NF) { 
            print a[j + i];
        }
    }
}

Given:

0 0 0 0.2340
0.05 9.6877884e-06 0.0024898597 0.2341
0.1 4.2838688e-05 0.0049595502 0.2342
0.15 0.00016929444 0.0074092494 0.2343
0.2 0.00036426881 0.009839138 0.2344
0.25 0.00055234582 0.012249394 0.2345
0.3 0.00077448576 0.014640196 0.2346
0.35 0.00082546537 0.017011717 0.2347
0.4 0.0012371619 0.019364133 0.2348
0.45 0.0013286382 0.02169761 0.2349

Order by column (2..n) then by line:

0 0
0.05 9.6877884e-06
0.1 4.2838688e-05
0.15 0.00016929444
0.2 0.00036426881
0.25 0.00055234582
0.3 0.00077448576
0.35 0.00082546537
0.4 0.0012371619
0.45 0.0013286382
0 0
0.05 0.0024898597
0.1 0.0049595502
0.15 0.0074092494
0.2 0.009839138
0.25 0.012249394
0.3 0.014640196
0.35 0.017011717
0.4 0.019364133
0.45 0.02169761
0 0.2340
0.05 0.2341
0.1 0.2342
0.15 0.2343
0.2 0.2344
0.25 0.2345
0.3 0.2346
0.35 0.2347
0.4 0.2348
0.45 0.2349

R

Although most people don't think of R for text processing, in this case, it's actually a bit more straight-forward, although all of the option setting makes it appear to be more complex than it really is. The essence of this solution is to simply rbind() multiple cbind():

d.in <- read.table(file = commandArgs(trailingOnly = T)[1]
                    , colClasses = "character");
d.out<-data.frame();
for (i in 2:length(d.in)) {
    d.out <- rbind(d.out, cbind(d.in[,1], d.in[,i]));
}
write.table(d.out, row.names = F, col.names = F, quote = F);

Then, just:

$ Rscript script.R data.txt
0 0
0.05 9.6877884e-06
0.1 4.2838688e-05
0.15 0.00016929444
0.2 0.00036426881
0.25 0.00055234582
0.3 0.00077448576
0.35 0.00082546537
0.4 0.0012371619
0.45 0.0013286382
0 0
0.05 0.0024898597
0.1 0.0049595502
0.15 0.0074092494
0.2 0.009839138
0.25 0.012249394
0.3 0.014640196
0.35 0.017011717
0.4 0.019364133
0.45 0.02169761
0 0.2340
0.05 0.2341
0.1 0.2342
0.15 0.2343
0.2 0.2344
0.25 0.2345
0.3 0.2346
0.35 0.2347
0.4 0.2348
0.45 0.2349

Best Answer

Related Solutions

Lum – Append columns in a text file to after the final row

Awk

R

Related Question