How to remove duplicate value in a tab-delimited text file

csv-simpletext processing

I have a tab delimited column text like below

A      B1      B1     C1
B      B2      D2 
C      C12     C13    C13
D      D3      D5      D9
G      F2      F2   

how could I convert the above table like below

A      B1     C1
B      B2     D2 
C      C12    C13
D      D3     D5     D9
G      F2   

I have extracted my real data file, it is a tab delimited file and I have tried the command line you (Stéphane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column

A  CD274    PDCD1LG2  CD276   PDCD1LG2  CD274
B  NEK2     NEK6      NEK10   NEK10     NEKL-4
C  TNFAIP3  OTUD7B    OTUD7B  TNFAIP3   TNFAIP3
D  DUSP16   DUSP4     DUSP8   VHP-1     DUSP8
E  AGO2     AGO2      AGO2    AGO2      AGO2

output need to be as below

A  CD274    CD276   PDCD1LG2
B  NEK2     NEK6    NEK10     NEKL-4
C  TNFAIP3  OTUD7B
D  DUSP16   DUSP4   DUSP8     VHP-1
E  AGO2

Best Answer

First set of example data:

$ awk -vOFS='\t' '{ r=""; delete t; for (i=1;i<=NF;++i) { if (!t[$i]++) { r = r ? r OFS $i : $i } } print r }' file
A       B1      C1
B       B2      D2
C       C12     C13
D       D3      D5      D9
G       F2

Second set of example data (same awk script):

$ awk -vOFS='\t' '{ r=""; delete t; for (i=1;i<=NF;++i) { if (!t[$i]++) { r = r ? r OFS $i : $i } } print r }' file
A       CD274   PDCD1LG2        CD276
B       NEK2    NEK6    NEK10   NEKL-4
C       TNFAIP3 OTUD7B
D       DUSP16  DUSP4   DUSP8   VHP-1
E       AGO2

The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.

When all the fields of an input line have been processed, the constructed line is outputted.

The output field delimiter is set to tab through -vOFS='\t' on the command line.


The awk script unravelled:

{
    r = ""
    delete t

    for (i = 1; i <= NF; ++i) {
        if (!t[$i]++) {
            r = r ? r OFS $i : $i
        }
    }

    print r
}
Related Question