text-processing command-line sort – How to Sort Letters in a Word in a Line of Text

command linesorttext processing

So I have a file full of test commands that I like to run against some of my functions to make sure they are handling all possible situations correctly. No point in having duplicate commands tho. Here's some examples:

rap ,Xflg MIT X11           
rap ,XPBfl 'MITER'
rap ,Bflg share git-grep    
rap ,bfl X11
rap ,Bfl xzfgrep
rap ,Bf X11

… my function 'rap' uses a comma instead of a dash to indicate the start of letter options, then there's some argument following. Since the order of these options doesn't matter:

rap ,Bf X11
rap ,fB X11

… are exactly the same command. Easy to remove duplicate lines from the file of course, however to avoid the above problem, what I'd like to be able to do is to sort the options alphabetically so that the above would end up:

rap ,Bf X11
rap ,Bf X11

… and I'd then be able to delete the duplicates. Can something like that be done without heroics? Note this is not sorting 'by' the list of options, but sorting the options themselves.

Best Answer

Another perl variant:

$ perl -pe 's{^rap ,\K\S+}{join "", sort split //, $&}e' file
rap ,Xfgl MIT X11
rap ,BPXfl 'MITER'
rap ,Bfgl share git-grep
rap ,bfl X11
rap ,Bfl xzfgrep
rap ,Bf X11

For your extra requirement of having lower case letters before upper case ones, you can rely on the fact that in ASCII, 'x' is 'X' ^ 32 (and 'X' is 'x' ^ 32):

$ perl -pe 's{^rap ,\K\S+}{join "", sort {(ord($a)^32) <=> (ord($b)^32)} split //, $&}e' file
rap ,fglX MIT X11
rap ,flBPX 'MITER'
rap ,fglB share git-grep
rap ,bfl X11
rap ,flB xzfgrep
rap ,fB X11

Related Solutions

Bash – Command line method to find repeat-word typos, with line numbers

Edited: added install and demo

You need to take care of at least some edge cases, like

repeated words at the end (and beginning) of the line.
search should be case insensitive, because of frequent errors like The the apple.
probably you want to restrict search only to word constituent to not match something like ( ( a + b) + c ) (repeated opening parentheses.
only full words should match to eliminate the thesis
When it comes to human language Unicode characters inside words should properly interpreted

All in all I recommend pcregrep solution:

pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' file

Obviously color and line number (n option) is optional, but usually nice to have.

Install

On Debian-based distributions you can install via:

$ sudo apt-get install pcregrep

Example

Run the command on jefferson_typo.txt to see:

$ pcregrep -Min --color=auto '\b([^[:space:]]+)[[:space:]]+\1\b' jefferson_typo.txt
1:He has has refused his Assent to Laws, the most wholesome and necessary
3:He has forbidden his Governors to pass Laws of immediate and
and pressing importance, unless suspended in their operation till his
5:Assent should be be obtained; and when so suspended, he has utterly

The above is just a text capture, but on a color-supported terminal, matches are colorized:

has has
and
and
be be

Sort unix alphabetically then numerically, not working as I intended

sort -k1,1 -nk2 is the same as sort -k1,1 -n -k2, same as sort -n -k1,1 -k2, as in the numerical sorting is turned on globally, for all the keys.

To sort the 2^nd key only numerically, you need to add n to that sort key description as in:

sort -k1,1 -k2n

Or:

sort -k1,1 -k2,2n

With n and with the default field separator 2 is the same as 2,2 though. 2 would be the part of the line starting from the second field, but when interpreted as a number, that's the same as the second field alone (2,2).

Here, you could also sort numerically on the number that is after chr and then alphabetically on the rest of the first field and then numerically on the second field with:

sort -k1.4n -k1,1 -k2n