Linux – How to convert a 3 column csv file into a table (or matrix)

command linecsvfile formatlinuxtext formatting

I have an CSV input file format like this, with a nucleotide sequence in field 1, text in field 2, and an integer in field 4:

ATGC,CD3,56
ATGC,CD4,67
ATGC,IgD,126
ATGC,IgM,127
AGTC,CD3,67
AGTC,CD4,78
AGTC,IgD,102
AGTC,IgM,89
TCGA,CD3,334
TCGA,CD4,123
TCGA,IgD,456
TCGA,IgM,80
CGTA,CD3,54
CGTA,CD4,32
CGTA,IgD,82
CGTA,IgM,117

I opened this CSV file using Numbers in Mac it's display as 3 columns format, however, I want to convert it to the table(or Matrix) format (also a CSV file), making the first column, the nucleotide sequences, into a header, and want the result to also look like a table (or Matrix):

     ATGC  AGTC  TCGA  CGTA
CD3  56    67    334   54
CD4  67    78    123   32
IgD  126   102   456   82
IgM  127   89    80    117

Below is a section from my real input CSV file (sample input.txt):

AGAATAGTCTGATTCT,-,,38
AGAATAGTCTGATTCT,AnnexinV,,51
AGAATAGTCTGATTCT,CD127,,39
AGAATAGTCTGATTCT,CD138,,3
AGAATAGTCTGATTCT,CD14,,2
AGAATAGTCTGATTCT,CD16,,4
AGAATAGTCTGATTCT,CD19,,10
AGAATAGTCTGATTCT,CD20,,6
AGAATAGTCTGATTCT,CD24,,21
AGAATAGTCTGATTCT,CD25,,4
AGAATAGTCTGATTCT,CD27,,87
AGAATAGTCTGATTCT,CD3,,235
AGAATAGTCTGATTCT,CD34,,5
AGAATAGTCTGATTCT,CD38,,18
AGAATAGTCTGATTCT,CD4,,412
AGAATAGTCTGATTCT,CD43,,99
AGAATAGTCTGATTCT,CD5,,430
AGAATAGTCTGATTCT,CD56,,3
AGAATAGTCTGATTCT,CD8,,7
AGAATAGTCTGATTCT,IgD,,4
AGAATAGTCTGATTCT,IgM,,2
TGTGGTAGTTCGTCTC,-,,9
TGTGGTAGTTCGTCTC,AnnexinV,,42
TGTGGTAGTTCGTCTC,CD127,,6
TGTGGTAGTTCGTCTC,CD138,,4
TGTGGTAGTTCGTCTC,CD16,,40
TGTGGTAGTTCGTCTC,CD19,,7
TGTGGTAGTTCGTCTC,CD20,,2
TGTGGTAGTTCGTCTC,CD24,,24
TGTGGTAGTTCGTCTC,CD25,,2

How can I do this using Linux text formatting commands?

Best Answer

Using awk:

{
    ks[$1 $2] = $3; # save the third column using the first and second as index
    k1[$1]++;       # save the first column
    k2[$2]++;       # save the second column
}
END {                                # After processing input
    for (j in k1) {                  # loop over the first column 
        printf "\t%s", j;            # and print column headers
    };
    print "";                        # newline
    for (i in k2) {                  # loop over the second 
        printf "%s", i;              # print it as row header
        for (j in k1) {              # loop over first again
            printf "\t%s", ks[j i];  # and print values
        }
        print "";                    # newline
    }
}

Output:

~ awk -F, -f foo.awk foo
        AGTC    ATGC    CGTA    TCGA
CD4     78      67      32      123
IgD     102     126     82      456
IgM     89      127     117     80
CD3     67      56      54      334
Related Question