Text Processing – How to Remove Duplicated Entries in CSV Fields

awkcsv-simpleperltext processingunicode

How do I remove duplicated entries within each separate fields with below sample as data.

0x,9.4,,,#0,#UNIX#unix,#cli#L#فا#0#فا#0#L#SE#Cli#SE,#فارسی#فارسی#۱#1#١#1,bsh,#V & v

expected output(either delete all duplicated ones, case-insensitive, difference in Unicode "Persian #۱/Arabic #١", order of entries and which entry (ignore case) should keep doesn't matter here):

0x,9.4,,,#0,#unix,#cli#L#فا#0#SE,#فارسی#١#۱#1,bsh,#V & v

The pattern is in this format #x, x means anything in one or more length of characters.

Unicode table for Persian/Arabic languages alphabet/numbers differences

Best Answer

Using a perl command line in a shell (just a few lines) with a proper csv parser :

perl -CS -Mopen=":std,IN,OUT,IO,:encoding(utf8)" -MText::CSV -lne '
    BEGIN{
        our $csv = Text::CSV->new({ sep_char => "," });
        sub uniq { my %seen;  grep !$seen{lc $_}++, @_; }
    };
    $csv->parse($_) or die "parse error";
    print join ",", map { join "#", uniq split /#/ } $csv->fields();
' file.csv

Output :

0x,9.4,,,#0,#UNIX,#cli#L#فا#0#SE,#فارسی#۱#1#١,bsh,#V & v

Note :

require to install Text::CSV perl module : sudo apt-get install libtext-csv-perl for debian and derivative

Related Solutions

Remove duplicates values within a field

Another perl solution:

perl -anle '                                                                    
    print "$F[0] ", join ",", grep {!$seen{$_}++} split ",",$F[1];              
    %seen=();                                                                   
' file
A 1,2,3,45,8
B 5,6,2,3,7

Remove Duplicate Entries from CSV File – Text Processing Guide

The reason the myfile.csv is not changing is because the -u option for uniq will only print unique lines. In this file, all lines are duplicates so they will not be printed out.

However, more importantly, the output will not be saved in myfile.csv because uniq will just print it out to stdout (by default, your console).

You would need to do something like this:

$ sort -u myfile.csv -o myfile.csv

The options mean:

-u - keep only unique lines
-o - output to this file instead of stdout

You should view man sort for more information.