Text Processing – How to Find Non-Repetitive Letter from a Given String

text processing

I have a string aaabefhhhhhthkkd from which I just need to extract non repetitive letters as output, preserving the order.

The string may contain upper case or lower case letters.

Input:

aaabefhhhhhthkkd

Output:

beftd

How this logic need to be defined so that I get the above required output?

I tried to use this command but it only partially worked for me:

echo "aaabefhhhhhthkkd" | sed 's/./&\n/g' | uniq

Output of above partially worked command:

a
b
e
f
h
t
h
k
d

Sample String to test:

String 1: aaabefhhhhhthkkd -> Output -> beftd

String 2: AAAbefhhhhhThkkD -> Output -> befTD 

String 3: AAAbefhMThkkD    -> Output -> befMTD 

Best Answer

uniq only works on adjacent duplicates - so if you want to use that, you'd need to sort your input first, for example:

fold -w1 | sort | uniq -u | paste -sd ''
  • fold -w1 does the same as your sed 's/./&\n/g' but without introducing an extra spurious newline
  • sort to make duplicate characters adjacent
  • uniq -u the -u is important to only print singletons
  • paste -sd '' joins the result back into a single line

Because of the sorting, you will not be able to get your desired output order in all cases ex.

$ echo 'AAAbefhMThkkD' | fold -w1 | sort | uniq -u | paste -sd ''
  DMTbef

If you don't want to roll your own solution, you could always use Perl's MoreUtils:

$ echo 'AAAbefhMThkkD' |
    perl -MList::MoreUtils=singleton -ne 'print singleton split //'
befMTD
Related Question