Lum – AWK – a question about columns

awkcolumns

I have a question. I was trying to deal with it by myself, but it seems like I am too new in awk to make it work.

Let's assume that we have a file (eg. database.txt) (values are tab-separated):

NA64715 YU24921 MI84612 MI98142 NA94732    
3241531 4957192 4912030 6574918 0473625     
0294637 9301032 8561730 8175919 8175920     
9481732 9359032 8571930 8134983 9385130     
9345091 9385112 2845830 4901742 3455141     

In a separate file (eg. populations.txt) I have information about which ID belongs to which group, eg.:

NA64715 Europe    
YU24921 Europe    
MI84612 Asia    
MI98142 Africa    
NA94732 Asia    

What I need to do is to force awk to create separate files with columns for all groups (Europe, Asia, Africa). The file I need to work on is huge, so I cannot simply count and number columns and do it the easy way. I need awk to check which ID belongs to which population (Europe etc.), then find that particular column in a database file, and then copy a whole column to a new file (separate for all the populations).

The result should look like:

File 1 (europe.txt):

NA64715 YU24921     
3241531 4957192     
0294637 9301032     
9481732 9359032    
9345091 9385112      

File 2 (asia.txt)

MI84612 NA94732    
4912030 0473625    
8561730 8175920    
8571930 9385130    
2845830 3455141    

File 3 (africa.txt)

MI98142     
6574918    
8175919    
8134983    
4901742    

Can anyone help me with this issue?

Best Answer

This works in one pass through the file, and does not need to store the whole file in memory. It does keep open file descriptors for each destination file.

awk -F '\t' '
    NR==FNR {population[$1]=$2; next}
    FNR==1 {
        for (i=1; i<=NF; i++) {
            destination[i] = population[$i] ".txt"
        }
    }
    {
        delete separator
        for (i=1; i<=NF; i++) {
            printf "%s%s", separator[destination[i]], $i > destination[i]
            separator[destination[i]] = FS
        }
        for (file in separator) {
            printf "\n" > file
        }
    }
' populations.txt database.txt
Related Question