Bash – How to search many files with regular expression and output matches to lines in a new file

bashregular expressionsearch

I have 1000's of source files and I would like to find all text that matches a regular expression and then output each match on its own line in a resulting text file.

For instance;

// a.cs
string test = _.Text("Hello World!") + _.Text("Foo");
// b.cs
Debug.Log(_.ActionText("Bar"));

// results.txt
_.Text("Hello World")
_.Text("Foo")
_.ActionText("Bar")

Which command would is capable of achieving this? could you please show an example?

Best Answer

sed '/\n/P;//!s/_\.[^ ("]*Text([^)]*)/\n&\n/;D' files... >results.txt

...would probably work. Run on your example data it prints:

_.Text("Hello World!")
_.Text("Foo")
_.ActionText("Bar")

All it does is attempt to enclose the first match on a line in \newlines. Whether or not it succeeds it Deletes up to the first \newline in pattern space - which for a non-matching line completely removes it from output, but for a match deletes only up to the head of your pattern and the script starts again from the top. If a \newline is matched in pattern space - which can only happen if a match was just found and then Deleted - then sed prints only up to the first occurring \newline in pattern space - which is at the tail of your matched string. The s///ubstitution is !not attempted when there is a \newline already in pattern space, so the Delete command clears the already printed match and the cycle starts again from the tail of the last match on.

Depending on your sed you may need to use a literal \newline in place of the n in the right-hand substitution field, though. But you should be able to do all of the file arguments at once - or, at least, very many at a time (depending on your ARGMAX limits). You can just shell glob for those, or maybe do...

find /path -name pattern -exec sed script_above {} + >>results.txt

...because sed will treat all input files as a single stream.

searching with YUM

You generally don't use any regular expressions (globs) when searching with yum search since the command search is already looking for sub-strings within the package names and their summaries. How do I know this? There's a message that tells you this when you use yum search.

Name and summary matches only, use "search all" for everything.

NOTE: The string [cl-*] is technically a glob in the Bash shell.

So you generally look for fragments of strings that you want with search. The regular expressions come into play when you're looking for particular packages. These are the YUM commands like list and install.

For example:

$ yum list cl-* | expand
Loaded plugins: fastestmirror, langpacks, refresh-packagekit, tsflags
Loading mirror speeds from cached hostfile
 * fedora: mirror.dmacc.net
 * rpmfusion-free: mirror.nexcess.net
 * rpmfusion-free-updates: mirror.nexcess.net
 * rpmfusion-nonfree: mirror.nexcess.net
 * rpmfusion-nonfree-updates: mirror.nexcess.net
 * updates: mirror.dmacc.net
Available Packages
cl-asdf.noarch                  20101028-5.fc19                 fedora          
cl-clx.noarch                   0.7.4-4.3                       home_zhonghuaren
cl-ppcre.noarch                 2.0.3-3.3                       home_zhonghuaren

The only caveat you have to be careful with regexes/globs, is if there are files within your shell that are named such that they too would matchcl-*. In those cases your shell will expand the regex/glob prior to it being presented to YUM.

So instead of running yum list cl-* you'll be running the command yum list cl-file, if there's a file matching the regex/glob cl-*.

For example:

$ ls cl-file
cl-file

$ yum list cl-*
Loaded plugins: fastestmirror, langpacks, refresh-packagekit, tsflags
Loading mirror speeds from cached hostfile
 * fedora: mirror.steadfast.net
 * rpmfusion-free: mirror.nexcess.net
 * rpmfusion-free-updates: mirror.nexcess.net
 * rpmfusion-nonfree: mirror.nexcess.net
 * rpmfusion-nonfree-updates: mirror.nexcess.net
 * updates: mirror.steadfast.net
Error: No matching Packages to list

You can guard against this happening by escaping the wildcard like so:

$ yum list cl-\* | expand
Loaded plugins: fastestmirror, langpacks, refresh-packagekit, tsflags
Loading mirror speeds from cached hostfile
 * fedora: mirror.dmacc.net
 * rpmfusion-free: mirror.nexcess.net
 * rpmfusion-free-updates: mirror.nexcess.net
 * rpmfusion-nonfree: mirror.nexcess.net
 * rpmfusion-nonfree-updates: mirror.nexcess.net
 * updates: mirror.dmacc.net
Available Packages
cl-asdf.noarch                  20101028-5.fc19                 fedora          
cl-clx.noarch                   0.7.4-4.3                       home_zhonghuaren
cl-ppcre.noarch                 2.0.3-3.3                       home_zhonghuaren

So what about the brackets

I suspect you have files in your local directory that are getting matched when you used [cl-*] as an argument to yum search. These files after being matched by the shell, were passed to the yum search command where matches where then found.

For example:

$ ls cl-file
cl-file

$ yum search cl-*
Loaded plugins: fastestmirror, langpacks, refresh-packagekit, tsflags
Loading mirror speeds from cached hostfile
 * fedora: mirror.dmacc.net
 * rpmfusion-free: mirror.nexcess.net
 * rpmfusion-free-updates: mirror.nexcess.net
 * rpmfusion-nonfree: mirror.nexcess.net
 * rpmfusion-nonfree-updates: mirror.nexcess.net
 * updates: mirror.dmacc.net
======================================================================= N/S matched: cl-file =======================================================================
opencl-filesystem.noarch : OpenCL filesystem layout

  Name and summary matches only, use "search all" for everything.

NOTE: The match above was matched against my file's name, cl-file, and not the cl-* as I had intended.

How to find all matches to a regular expression in a string

In POSIX awk,
Is there a builtin function which can achieve either of the two objectives?

No. You can achieve the same effect, but not with a single builtin function.

Does the match builtin function only find the leftmost and longest match?

Yes. Regular expressions in POSIX awk (and GNU awk) are always greedy (i.e. longest match always wins).

To achieve the first objective, is it a correct way to repeatedly apply match to the suffix of the target string created by finding each match and removing the match and the prefix before it from the target string?

Yes, but if you want 100% compatibility with gsub() the details are pretty tricky.

Is https://gist.github.com/mllamazing/a40946fcf8211a503bed a correct implementation?

Mostly, if you remove the gsub line. The devil is in the details: the code will loop if regex is an empty string. Classic awk didn't allow empty regexps, but IIRC nawk did. To fix that you could do something like this:

function FindAllMatches(str, regex, match_arr) {

    ftotal = 0;
    ini = RSTART;
    leng = RLENGTH;

    delete match_arr;

    while (str != "" && match(str, regex) > 0) {
        match_arr[++ftotal] = substr(str, RSTART, RLENGTH)
        str = substr(str, RSTART + (RLENGTH ? RLENGTH : 1))
    }

    RSTART = ini;
    RLENGTH = leng;
}

That's not 100% compatible to gsub() however, because

$ echo 123 | awk '{ gsub("", "-") } 1'
-1-2-3-

while the function above finds only 3 matches (namely, it misses the match at the end).

You could try this instead:

function FindAllMatches(str, regex, match_arr) {

    ftotal = 0;
    ini = RSTART;
    leng = RLENGTH;

    delete match_arr;

    while (match(str, regex) > 0) {
        match_arr[++ftotal] = substr(str, RSTART, RLENGTH)
        if (str == "") break
        str = substr(str, RSTART + (RLENGTH ? RLENGTH : 1))
    }

    RSTART = ini;
    RLENGTH = leng;
}

This fixes the problem above, but it breaks other cases: if str = "123" and regex = "[1-9]*" the function finds two occurrences, 123 and the empty string at the end, while gsub() finds only one, 123.

There may be other similar differences, that I can't be bothered to hunt right now.

In Gawk,

does array after a call patsplit(string, array, fieldpat, seps) store the matches as required in the second objective?

Mostly yes. However, corner cases related to regexps can be unexpectedly subtle. There may be some differences, as above.

Can the locations of the match location be found from array and seps, based on that seps[i] is the separator string between array[i] and array[i+1]?

Yes.

Best Answer

Related Solutions

Yum – Regular Expression or Wildcard Search?

searching with YUM

So what about the brackets

How to find all matches to a regular expression in a string

Related Question