Parsing Log Files with sed -e. Need to count unique class names

regular expressionsedtext processing

I have a file, let's call it filename.log, in it I have something like this

(2014-11-18 14:09:21,766), , xxxxxx.local, EventSystem, DEBUG FtpsFile delay secs is 5 [pool-3-thread-7] 
(2014-11-18 14:09:21,781), , xxxxxx.local, EventSystem, DEBUG FtpsFile disconnected from ftp server [pool-3-thread-7] 
(2014-11-18 14:09:21,798), , xxxxxx.local, EventSystem, DEBUG FtpsFile FTP File  Process@serverStatus on exit  - 113 [pool-3-thread-7] 
(2014-11-18 14:09:21,798), , xxxxxx.local, EventSystem, DEBUG FtpsFile FTP File  Process@serverStatus on exit  - 114 [pool-3-thread-7] 
(2014-11-18 14:09:21,799), , xxxxxx.local, EventSystem, DEBUG JobQueue $_Runnable Finally of consume() :: [pool-3-thread-7] 

I am trying to find the classes the produce the most frequent DEBUG messages.

In this example you can see FtpsFile and JobQueue are two of the classes producing a message.

I have this

cat filename.log | sed -n -e 's/^.*\(DEBUG \)/\1/p' | sort | uniq -c | sort -rn | head -10

This will produce the class name and show me the most frequent classes as a top 10.

The problem is this does not give me the count of the class FtpsFile as 4. It counts each FtpsFile log file as a different unique entity.

How do I change the command above to basically say grab the first word after DEBUG and ignore the rest for your count?

Ideally I should get
4 FtpsFile
1 JobQueue

Best Answer

  • With GNU sed:

    sed 's/.*DEBUG \(\w*\).*/\1/' | uniq -c
          4 FtpsFile
          1 JobQueue
    
  • With grep:

    grep -Po 'DEBUG \K\w+' | uniq -c
          4 FtpsFile
          1 JobQueue
    
  • With awk:

    awk '$6=="DEBUG"{print $7}' | uniq -c
          4 FtpsFile
          1 JobQueue
    

The last one can be done in pure awk, but for a sake of similarity I piped it to uniq.

Related Question