With find
:
cd /the/dir
find . -type f -exec grep pattern {} +
(-type f
is to only search in regular files (also excluding symlinks even if they point to regular files). If you want to search in any type of file except directories (but beware there are some types of files like fifos or /dev/zero that you generally don't want to read), replace -type f
with the GNU-specific ! -xtype d
(-xtype d
matches for files of type directory after symlink resolution)).
With GNU grep
:
grep -r pattern /the/dir
(but beware that unless you have a recent version of GNU grep, that will follow symlinks when descending into directories). Non-regular files won't be searched unless you add a -D read
option. Recent versions of GNU grep
will still not search inside symlinks though.
Very old versions of GNU find
did not support the standard {} +
syntax, but there you could use the non-standard:
cd /the/dir &&
find . -type f -print0 | xargs -r0 grep pattern
Performances are likely to be I/O bound. That is the time to do the search would be the time needed to read all that data from storage.
If the data is on a redundant disk array, reading several files at a time might improve performance (and could degrade them otherwise). If the performances are not I/O bound (because for instance all the data is in cache), and you have multiple CPUs, concurrent greps
might help as well. You can do that with GNU xargs
's -P
option.
For instance, if the data is on a RAID1 array with 3 drives, or if the data is in cache and you have 3 CPUs whose time to spare:
cd /the/dir &&
find . -type f -print0 | xargs -n1000 -r0P3 grep pattern
(here using -n1000
to spawn a new grep
every 1000 files, up to 3 running in parallel at a time).
However note that if the output of grep
is redirected, you'll end up with badly interleaved output from the 3 grep
processes, in which case you may want to run it as:
find . -type f -print0 | stdbuf -oL xargs -n1000 -r0P3 grep pattern
(on a recent GNU or FreeBSD system) or use the --line-buffered
option of GNU grep
.
If pattern
is a fixed string, adding the -F
option could improve matters.
If it's not multi-byte character data, or if for the matching of that pattern, it doesn't matter whether the data is multi-byte character or not, then:
cd /the/dir &&
LC_ALL=C grep -r pattern .
could improve performance significantly.
If you end up doing such searches often, then you may want to index your data using one of the many search engines out there.
Best Answer
It's not the shortest possible match, just a short match. Greedy mode tries to find the last possible match, lazy mode the first possible match. But the first possible match is not necessarily the shortest one.
Take the input string
foobarbaz
and the regexpo.*a
(greedy) oro.*?a
(lazy).The shortest possible match in this input string would be
oba
.However the RegExp looks for matches from left to right, so the
o
finds the firsto
infoobarbaz
. And if the rest of the pattern produces a match, that's where it stays.Following the first
o
,.*
(greedy) eatsobarbaz
(the entire string) and then backtracks in order to match the rest of the pattern (a
). Thus it finds the last a inbaz
and ends up matchingoobarba
.Following the first
o
,.*?
(lazy) doesn't eat the entire string, instead it looks for the first occurrence of the rest of the pattern. So first it sees the secondo
, which doesn't matcha
, then it seesb
, which doesn't matcha
, then it seesa
, which matchesa
, and because it's lazy that's where it stops. (and the result isooba
, but notoba
)So while it's not THE shortest possible one, it's a shorter one than the greedy version.