Remove duplicate fields in a given column

awk

I'd like to remove from a given column ($2 in the example) the duplicate fields (comma separated).

Input file:

A    1,2,3,4   
B    4,5,6,3
C    2,15

Expected output:

A    1,2,3,4
B    5,6
C    15

Best Answer

perl -lpe 's/\s\K\S+/join ",", grep {!$seen{$_}++} split ",", $&/e'

You can run the above like so:

$ perl -lpe 's/\s\K\S+/join ",", grep {!$seen{$_}++} split ",", $&/e' afile 
A    1,2,3,4
B    5,6
C    15

How it works

First calling perl with -lpe does the following 3 things.

  • -l[octal] enable line ending processing, specifies line terminator
  • -p assume loop like -n but print line also, like sed
  • -e program one line of program (several -e's allowed, omit programfile)

This essentially take the file in, strips off the newlines, operates on a line, and then tacks a newline character back onto it when it's done. So it's just looping through the file and executing our Perl code against each in turn.

As for the actual Perl code:

  • \s means a spacing character (the five characters [ \f\n\r\t] and \v in newer versions of perl, like [[:space:]]).
  • \K Keep the stuff left of the \K, don't include it in $&
  • \S+ one or more characters not in the set [ \f\n\r\t\v]

The join ",", is going to take the results and rejoin each field so that it's separated by a comma.

The split ",", $& will take the matches that were found by the \S+ and split them into just the fields, without the comma.

The grep {!$seen{$_}++} will take each field's number, add it to the hash, $seen{} where each field's number is $_ as we go through each of them. Each time a field number is "seen" it's counted via the ++ operator, $seen{$_}++.

The grep{!$seen{$_}++} will return a field value if it's only been seen once.

Modified to see what's happening

If you use this modified abomination you can see what's going on as this Perl one liner moves across the lines from the file.

$ perl -lpe 's/\s\K\S+/join ",", grep {!$seen{$_}++} split ",", $&/e; @a=keys %seen; @b=values %seen; print "keys: @a | vals: @b"' afile 
keys: 4 1 3 2 | vals: 1 1 1 1
A    1,2,3,4
keys: 6 4 1 3 2 5 | vals: 1 2 1 2 1 1
B    5,6
keys: 6 4 1 3 2 15 5 | vals: 1 2 1 2 2 1 1
C    15

This is showing you the contents of $seen{} at the end of processing a line from the file. Let's take the 2nd line of the file.

B    4,5,6,3

And here's what my modified version shows that line as:

keys: 6 4 1 3 2 15 5 | vals: 1 2 1 2 2 1 1

So this is saying that we've seen field # 6 (1 time), field # 4 (2 times), etc. and field # 5 (1 time). So when grep{...} returns the results it will only return results from this array if it was present in this line (4,5,6,3) and if we've seen it only 1 time (6,1,15,5). The intersection of these 2 lists is (5,6) and so that's what gets returned by grep.

References

Related Question