Text Processing – How to Remove Duplicated Entries in CSV Fields

awkcsv-simpleperltext processingunicode

How do I remove duplicated entries within each separate fields with below sample as data.

0x,9.4,,,#0,#UNIX#unix,#cli#L#فا#0#فا#0#L#SE#Cli#SE,#فارسی#فارسی#۱#1#١#1,bsh,#V & v

expected output(either delete all duplicated ones, case-insensitive, difference in Unicode "Persian /Arabic ", order of entries and which entry (ignore case) should keep doesn't matter here):

0x,9.4,,,#0,#unix,#cli#L#فا#0#SE,#فارسی#١#۱#1,bsh,#V & v

The pattern is in this format #x, x means anything in one or more length of characters.

Unicode table for Persian/Arabic languages alphabet/numbers differences

Best Answer

Using a command line in a shell (just a few lines) with a proper parser :

perl -CS -Mopen=":std,IN,OUT,IO,:encoding(utf8)" -MText::CSV -lne '
    BEGIN{
        our $csv = Text::CSV->new({ sep_char => "," });
        sub uniq { my %seen;  grep !$seen{lc $_}++, @_; }
    };
    $csv->parse($_) or die "parse error";
    print join ",", map { join "#", uniq split /#/ } $csv->fields();
' file.csv

Output :

0x,9.4,,,#0,#UNIX,#cli#L#فا#0#SE,#فارسی#۱#1#١,bsh,#V & v

Note :

  • require to install Text::CSV perl module : sudo apt-get install libtext-csv-perl for debian and derivative
Related Question