Shell – How to get the unique count of a particular part of a string

grepscriptingshellsorttext processing

I have a set of data in a file.

psf7433-nlhrms
unit7433-nobody
unit7333-opera
bpx7333-operations
app7333-osm
unit7330-partners
psf7331-pdesmond
unit7333-projm
mnp7330-redirect
unit7333-retailbanking
cpq7333-rkarmer
unit6333-sales
ring7323-support
unit7133-telco
post7323-uadb
sun7335-ukhrms
burp7133-wfnmreply

How to ignore the starting alphabetic characters in each line and the characters after the numeric and get the count of the unique numbers.
(or)
How to retrieve only the numeric value in each line and get their unique count.

Considering we manage to extract only the numeric values, we will get this.

Now, I want the unique count of the retrieved numeric values. So ignoring the repetitions, I should get the following final output.

I am unable to do this either by using awk or sed or even simple grep | cut

I do not want the list of extracted values, I want only the final count as the answer.

Help me!

Best Answer

With grep, filter out just the numbers:

grep -Eo '[0-9]+-' file | sort -u | wc -l

[0-9] Matches any character between 0 and 9 (any digit).
+ in extended regular expressions stands for at least one character (that's why the -E option is used with grep). So [0-9]+- matches one or more digits, followed by -.
-o only prints the part that matched your pattern, so given input abcd23-gf56, grep will only print 23-.
sort -u sorts and filters unique entries (due to -u), and wc -l counts the number of lines in input (hence, the number of unique entries).

Related Solutions

Using Grep -o or Sed/Awk to Grab Snippet from Middle of String

You can do

sed 's/^.*search?q=\([^&]*\)&.*/\1/' file

What this does is does a non greedy match between the search?q= and the &

Which outputs

dagger+genesis+solo

If you want to replace the + signs with spaces,

sed 's/^.*search?q=\([^&]*\)&.*/\1/;s/+/ /g' file

Which outputs

dagger genesis solo

Count unique associated values in awk (or perl)

With awk:

awk 'function p(){print l,c,d; delete a; delete b; c=d=0} 
  NR!=1&&l!=$1{p()} ++a[$2]==1{c++} ++b[$3]==1{d++} {l=$1} END{p()}' file

Explanation:

function p(): defines a function called p(), which prints the values and deletes the used variables and arrays.
NR!=1&&l!=$1 if its not the first line and the variable l equals the first field $1, then run the p() function.
++a[$2]==1{c++} if the increment of the element value of the a array with index $2 equals 1, then that value is first seen, and therefore increment the c variable. The ++ before the element, returns the new value, therefore causes an increment before the comparsion with 1.
++b[$3]==1{d++} the same as above but with the 3rd field and the d variable.
{l=$1} The l to the first field (for the next iteration.. above)
END{p()} after the last line is processed, awk has to print the values for the last block

With your given input the outout is:

apple 3 2
banana 4 5
cucumber 2 3

Best Answer

Related Solutions

Using Grep -o or Sed/Awk to Grab Snippet from Middle of String

Count unique associated values in awk (or perl)

Related Question