Shell – Replace each unique value in all columns with a unique identifier

awklinuxperlshelltext processing

I have a file with 250k rows and 10 columns like:

img1 aa bb cc ...
img2 aa yy dd ...
img3 uu bb ee ...
img4 NA bb tt ...

I want a script that will convert this file to:

img1 1 1 1 ...
img2 1 2 2 ...
img3 2 1 3 ...
img4 0 1 4 ...

A unique value from each column after first one should be replaced with a unique identifier starting from 0, where 0 is reserved for the string "NA".

Also, for each column, I want to generate a file that contains the mapping. For example, the file for 2nd column should be:

NA 0
aa 1
uu 2

Can anyone suggest an elegant solution for this? Any help would be greatly appreciated.

Best Answer

Here's a very simple approach. Works fine for me, using gawk 3.1.7.

#!/usr/bin/awk -f
{
    for(x=2;x<=NF;x++) {
        if(x$x in a) {
            $x=a[x$x]
        } else {
            if($x=="NA") {
                print $x,0 > "column"x
                a[x$x]=0
                $x="0"
            } else {
                m[x]++
                print $x,m[x] > "column"x
                a[x$x]=m[x]
                $x=m[x]
            }
        }
    }
    print $0 > "results"
}
Related Question