How to remove duplicate value in a tab-delimited text file

csv-simpletext processing

I have a tab delimited column text like below

A      B1      B1     C1
B      B2      D2 
C      C12     C13    C13
D      D3      D5      D9
G      F2      F2

how could I convert the above table like below

A      B1     C1
B      B2     D2 
C      C12    C13
D      D3     D5     D9
G      F2

I have extracted my real data file, it is a tab delimited file and I have tried the command line you (Stéphane Chazelas?) posted it works fine but it couldn't remove the duplicate on the last column

A  CD274    PDCD1LG2  CD276   PDCD1LG2  CD274
B  NEK2     NEK6      NEK10   NEK10     NEKL-4
C  TNFAIP3  OTUD7B    OTUD7B  TNFAIP3   TNFAIP3
D  DUSP16   DUSP4     DUSP8   VHP-1     DUSP8
E  AGO2     AGO2      AGO2    AGO2      AGO2

output need to be as below

A  CD274    CD276   PDCD1LG2
B  NEK2     NEK6    NEK10     NEKL-4
C  TNFAIP3  OTUD7B
D  DUSP16   DUSP4   DUSP8     VHP-1
E  AGO2

Best Answer

First set of example data:

$ awk -vOFS='\t' '{ r=""; delete t; for (i=1;i<=NF;++i) { if (!t[$i]++) { r = r ? r OFS $i : $i } } print r }' file
A       B1      C1
B       B2      D2
C       C12     C13
D       D3      D5      D9
G       F2

Second set of example data (same awk script):

$ awk -vOFS='\t' '{ r=""; delete t; for (i=1;i<=NF;++i) { if (!t[$i]++) { r = r ? r OFS $i : $i } } print r }' file
A       CD274   PDCD1LG2        CD276
B       NEK2    NEK6    NEK10   NEKL-4
C       TNFAIP3 OTUD7B
D       DUSP16  DUSP4   DUSP8   VHP-1
E       AGO2

The script reads the input file file line by line, and for each line it goes through each field, building up the output line, r. If the value in a field has already been added to the output line (determined by a lookup table, t, of used field values), then the field is ignored, otherwise it's added.

When all the fields of an input line have been processed, the constructed line is outputted.

The output field delimiter is set to tab through -vOFS='\t' on the command line.

The awk script unravelled:

{
    r = ""
    delete t

    for (i = 1; i <= NF; ++i) {
        if (!t[$i]++) {
            r = r ? r OFS $i : $i
        }
    }

    print r
}

Using sed

To find the first integer from the fifth column:

$ sed -r 's/([^\t]*\t){4}[^[:digit:]]*([[:digit:]]+).*/\2/' file
2458
45
78

The above was tested on GNU sed. For OSX or other BSD system, try:

sed -E 's/([^\t]*\t){4}[^[:digit:]]*([[:digit:]][[:digit:]]*).*/\2/' file

Using awk

$ awk '{sub(/^[^[:digit:]]*/, "", $5); sub(/[^[:digit:]].*/, "", $5); print $5;}' file
2458
45
78

Shell – Remove lines from tab-delimited file with missing values

If your fields can never contain whitespace, an empty field means either a tab as a first character (^\t), a tab as the last character (\t$) or two consecutive tabs (\t\t). You could therefore filter out lines containing any of those:

grep -Ev $'^\t|\t\t|\t$' file

If you can have whitespace, things get more complex. If your fields can begin with spaces, use this instead (it considers a field with only spaces to be empty):

grep -Pv '\t\s*(\t|$)|\t$|^\t' file

The change filters out lines matching a tab followed by 0 or more spaces and then either another tab or the end of the line.

That will also fail if the last field contains nothing but spaces. To avoid that too, use perl with the -F and -a options to split input into the @F array, telling it to print unless one of the fields is empty (/^$/):

perl -F'\t' -lane 'print unless grep{/^$/} @F' file

Best Answer

Related Solutions

Shell – How to extract the first integer from text string in a column of a tab-delimited file

Using sed

Using awk

Shell – Remove lines from tab-delimited file with missing values

Related Question