Linux – Counting all occurrences of a string within all files in a folder

greplinux

How do I count all occurrences of a particular string in all files in a folder?

For example,

1.txt: 'hahaha hehe'
2.txt: 'ha hee'

I would like to count all occurrences of 'ha' in these files.

I tried:

grep "ha" * | wc-l

But it just gives me individual occurrences in file output

Best Answer

You're close. To get a total count of all occurrences of "ha" within all .txt files in a folder:

grep -o "ha" *.txt | wc -l

From man grep:

-o, --only-matching
       Print only the matched (non-empty) parts of a matching line, with
       each such part on a separate output line.

This works because each match is printed on a separate line, thus allowing wc -l to count all of them.

By default, however, grep only finds the first occurrence on a line and outputs the whole line. Likewise, option -c only finds the first occurrence in all lines, then outputs how many lines had 1 (or more) matches.

EDIT:

Here is a way to print the total number of occurrences within each individual file (with filenames):

find *.txt -printf 'echo "$(grep -o "ha" %p | wc -l) %p";' | sh

#Example output
3 file1.txt
1 file2.txt

Explanation:

find *.txt - finds .txt files

-printf - prints everything between the single-quotes (formatted) to standard output, replacing occurrences of %p with find's output (file names)

$(grep -o "ha" %p | wc -l) - works as above

| sh - the output from -printf (which are commands) are piped to a shell and executed

Note that printf is invoked once per filename.

Related Solutions

Linux – Using grep to find a string in files

grep -c string a*

Bash – Efficient way to search string within file find and grep

The fastest I can come up with is to use xargs to share the load:

find . -type f -print0  | xargs -0 grep -Fil "mypattern"

Running some benchmarks on a directory containing 3631 files:

$ time find . -type f -exec grep -l -i "mystring" {} 2>/dev/null \;

real    0m15.012s
user    0m4.876s
sys     0m1.876s

$ time find . -type f -exec grep -Fli "mystring" {} 2>/dev/null \;

real    0m13.982s
user    0m4.328s
sys     0m1.592s


$ time find . -type f -print0  | xargs -0 grep -Fil "mystring" >/dev/null 

real    0m3.565s
user    0m3.508s
sys     0m0.052s

Your other options would be to streamline either by limiting the file list using find:

   -executable
          Matches files which are executable and  direc‐
          tories  which  are  searchable (in a file name
          resolution sense).  
   -writable
          Matches files which are writable.             

   -mtime n
          File's  data was last modified n*24 hours ago.
          See the comments for -atime to understand  how
          rounding  affects  the  interpretation of file
          modification times.
   -group gname
          File  belongs to group gname (numeric group ID
          allowed).
   -perm /mode
          Any  of  the  permission bits mode are set for
          the file.  Symbolic modes are accepted in this
          form.  You must specify `u', `g' or `o' if you
          use a symbolic mode. 
   -size n[cwbkMG]  <-- you can set a minimum or maximum size
          File uses n units  of  space.

Or by tweaking grep:

You are already using grep's -l option which cause the file name to be printed and, more importantly, stops at the first match:

   -l, --files-with-matches
       Suppress normal output; instead print the name of each input file  from
       which  output would normally have been printed.  The scanning will stop
       on the first match.  (-l is specified by POSIX.)

The only other thing I can think of to speed things up would be to make sure your pattern is not interpreted as a regex (as suggested by @suspectus) by using the -F option.

Best Answer

Related Solutions

Linux – Using grep to find a string in files

Bash – Efficient way to search string within file find and grep

Related Question