How to find the most common name in passwd file

grepsorttext processing

My /etc/passwd has a list of users in a format that looks like this:

username:password:uid:gid:firstname.lastname, somenumber:/...

Goal : I want to see only the first names and than sort them having the most common name appear first, 2nd most common appear 2nd etc….

I saw some solutions as to how to do the 2nd part, although they are relevant to working with a text file and not to reading from a map.

In regards to the first part, I really don't know how to approach this. I know that there are some solutions but don't really know how to do them.

Best Answer

One way to do it:

cut -d: -f5 /etc/passwd | \
    sed 's/\..*//' | \
    sort -i | \
    uniq -ci | \
    sort -rn

Using `sed` and `column`:

$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/' file | column -t
id  target_id    length  eff_length
1   FBgn0000721  1136    243.944268
1   FBgn0000721  1122    240.237419
2   FBgn0264373  56      0

The key part of this is the substitute command:

s/ intron_([^:]*):\S*/ \1/

It looks for intron_ and saves everything after intron_ and before the first colon into the variable 1. [^[:space:]]* matches everything from that colon to the end of the field. All of that gets replaced by the text saved in variable 1.

Using `awk` with tab-separated output:

$ awk -v "OFS=\t" '{$2=$2;sub(/intron_/, "", $2); sub(/:.*/, "", $2); print}' file
id      target_id       length  eff_length
1       FBgn0000721     1136    243.944268
1       FBgn0000721     1122    240.237419
2       FBgn0264373     56      0

Explanation:

-v "OFS=\t"

This sets the output field separator to a tab. This helps line up the columns, possibly making column unnecessary.
$2=$2

When printing a line, awk won't change to our newly-specified output field separator unless we change something on the line. Assigning the second field to the second field is sufficient to assure that the output will have tabs.
sub(/intron_/, "", $2)

This removes intron_ from the second field.
sub(/:.*/, "", $2)

This removes everything after the first colon from the second field.
print

This prints our new line.

Using `awk` with custom column formatting

This is like the above but uses printf so that we can custom-format column widths and alignments as desired:

$ awk  '{sub(/intron_/, "", $2); sub(/:.*/, "", $2); printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4}' file
id  target_id      length eff_length
1   FBgn0000721      1136 243.944268
1   FBgn0000721      1122 240.237419
2   FBgn0264373        56   0

Here the statement printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4 selects column widths and alignments in the usual printf style.

Using `sed` and converting from tab-separated to comma-separated

$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/; s/[[:space:]][[:space:]]*/,/g' file 
id,target_id,length,eff_length
1,FBgn0000721,1136,243.944268
1,FBgn0000721,1122,240.237419
2,FBgn0264373,56,0

Sort and Uniq in Awk – How to Use

To sort you can use a pipe also inside of an awk command, as in:

awk '{ print ... | "sort ..." }'

The syntax means that all respective lines of the data file will be passed to the same instance of sort.

Of course you can also do that equivalently on shell level:

awk '{ print ... }' | sort ...

Or you can use GNU awk which has a couple sort functions natively defined.

The uniq is in awk typically accomplished by saving the "unique data element or key" in an associative array and checking whether new data need to be memorized. One example to illustrate:

awk '!a[$0]++'

This means: If the current line is not in the array then the condition is true and the default action to print the line triggered. Subsequent lines with the same data will result in a false condition and the data will not be printed.

Best Answer

Related Solutions

Keep only certain part of a string in a certain column

Using sed and column:

Using awk with tab-separated output:

Using awk with custom column formatting

Using sed and converting from tab-separated to comma-separated

Sort and Uniq in Awk – How to Use

Related Question

Using `sed` and `column`:

Using `awk` with tab-separated output:

Using `awk` with custom column formatting

Using `sed` and converting from tab-separated to comma-separated