Shell – removing redundancy from output columns

awkshell-scripttext processing

What is a *NIX way of removing redundancy in a case where I have pairwise comparisons like these in two columns

    A B
    B A
    A C
    A D
    C A
    D A 
    B C
    C B

A B and B A represent the same comparison and I would like to remove such redundancy from the dataset. The final result should be

A B
A C
A D
B C

Best Answer

doit () 
{ 
    awk '{
           key=$1<=$2? $1 FS $2 : $2 FS $1; 
           if (!seen[key]) print $1,$2
           seen[key]=1
    }'
}

$ doit <test
A B
A C
A D
B C
$

(or, getting terser with it 'cause Chris Down's answer's so sweet)

awk '!seen[$1<=$2? $1 FS $2: $2 FS $1]++ {print $1,$2}'

which could be further reduced if you don't care about the spaces in your data

awk '!seen[$1<=$2? $1 FS $2: $2 FS $1]++'

)

The FS is awk's "field separator" variable, used here to guarantee the boundaries between key fields will be properly identified. My original had them run together, $1$2, which as Stephane Chazelas pointed out would have treated A BC and AB C as duplicates.

Related Solutions

Combine columns from several files into one

awk 'FNR==1 {print $2}' file*

This prints the second column ($2) of the first line (FNR==1) for every file whose filename starts with file.

An alternative is to print the first line and then immediately skip to the next file (nextfile is a mawk and GNU awk-specific keyword):

awk '{print $2; nextfile}' file*

Best Answer

Related Solutions

Combine columns from several files into one

Related Question