Bash – Sorting STDIN by length and number of nonblanks in a Bash Script

bashshell-scriptsort

I am working on learning Bash scripting but I am struggling with this problem. Given a bunch of lines from STDIN, sort them first by the length of the line in increasing order. Then, if there are any lines with the same number of characters, sort them by the number of nonblank characters contained in the lines (also in increasing order).

I've tried this a couple of different ways but I usually get caught up in some of the idiosyncrasies of Bash.

Here's what I've got so far:

#!/bin/bash

sorted=()
while IFS='' read -r line; do
    length=${#line}
    if [[ ${sorted[$length]} == "" ]] ; then
        sorted[$length]="$line"
    else
        #non unique length
        #sorted[$length]="${sorted[$length]}\n$line"
        IFS=$'\n' arr=("${sorted[$length]}")
        arr+=("$line")

        spaces=()

        for ((i=0 ; i < ${#arr[@]} ; ++i )) ; do
            spaces[$i]=$(echo "${arr[$i]}" | sed "s: : \n:g" | grep -c " ")
        done

        arr_sorted=()

        for ((i =0 ; i < ${#spaces[@]} ; i++ )) ; do
                for ((j=0 ; j < ${#arr[@]} ; i++ )) ; do

                        this_line_length=$(echo "${arr[$j]}" | sed "s: : \n:g" | grep -c " ")
                        if [[ "$this_line_length" == "${spaces[$i]}" ]] ; then
                            arr_sorted+=("${arr[$j]}")
                            unset arr[$j]
                        fi
                done
        done


    sorted[$length]="${arr_sorted[@]}"


    fi
done

I'm going to go ahead and guess this is nowhere near the best way to do it. I thought I would try to implement everything without relying too heavily on bash builtins but now it seems pretty pointless.

Best Answer

If you're allowed to use evil external contraptions such as sort and cut:

#! /bin/bash
while IFS= read -r line; do
    squeezed=$( tr -d '[:blank:]' <<<"$line" )
    printf '%d\t%d\t%s\n' ${#line} ${#squeezed} "$line"
done | sort -n -k 1 -k 2 | cut -f 3-

Edit: Since everybody's doing it, here's a solution with perl:

perl -e 'print sort { length $a <=> length $b || $a =~ y/ \t//c <=> $b =~ y/ \t//c } <>'

Related Solutions

Sort a file based on length of the column/row

You can first add another column with count of characters with awk, do sort and then strip added column:

awk '{printf "%d %s\n", length($1), $0}' file.txt | sort -n -k1,1 | sed -E -e 's/^[0-9]+ //'

Shell – Tokenize string from $REPLY in bash script

The shell has a built-in variable expansion field separator. So if you have a string and your delimiter is solid you can do:

var=32768,'dff0207a-591f-4435-9f8b-7b9b3e6ba2c1','d1f77359b3f7236806489ba3108c771f','NUMBER','US_EN','LATIN','GREEK','GERMAN'
(   IFS=,; set -f
    for field in $var
    do  printf '\n%s\n\t' "$field - md5:" >&2
        printf %s "$field" |
        md5sum
    done |
    cut -d\  -f1
)

32768 - md5:
        f43764367fa4b73ba947fae71b0223a4

dff0207a-591f-4435-9f8b-7b9b3e6ba2c1 - md5:
        0983e6c45209f390461c1b1df9320674

d1f77359b3f7236806489ba3108c771f - md5:
        07d82ab57ba81f991ab996bd7c5a0441

NUMBER - md5:
        34f55eca38e0605a84f169ff61a2a396

US_EN - md5:
        c9d3e580b7b102e864d9aea8703486ab

LATIN - md5:
        0e869135050d24ea6e7a30fc6edbac6c

GREEK - md5:
        d4cacc28e56302bcec9d7af4bba8c9a7

GERMAN - md5:
        ed73cca110623766d7a2457331a4f373

That should give you a newline separated list of md5s - as it did me.

IFS=, is used to specify that when any variable type shell expansion is performed the shell should split it out on the , character rather than <space><newline><tab> - which is the default. set -f is used to specify that if the shell should encounter any file globs within an unquoted expansion it should not expand them - so echo * would print only * regardless of the contents of the current directory.

For every comma separated field in $var the shell does printf "$field" | md5sum - so once per field without separator strings as I take the question to mean. And last cut trims the few spaces and the - at the end of each output line as it receives them. Most of the output is actually to stderr - cut only ever sees the md5sums.

Best Answer

Related Solutions

Sort a file based on length of the column/row

Shell – Tokenize string from $REPLY in bash script

Related Question