How to find all files containing various strings from a long list of string combinations

awkgreposxtext processing

I am still very new to command line tools (using my Mac OSX terminal) and hope I haven't missed the answer somewhere else, but I have searched for hours.

I have a text file (let's call it strings.txt) containing 200 combinations of 3 strings. [Edit 2017/01/30] The first five rows look like this:

"surveillance data" "surveillance technology" "cctv camera"
"social media" "surveillance techniques" "enforcement agencies"
"social control" "surveillance camera" "social security"
"surveillance data" "security guards" "social networking"
"surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

Note that I can change strings.txt to any other format, as long as the bigrams/ 2-word phrases like surveillance data in line 1 stay together. (That means I can delete the quotes if necessary, as for the answer by @MichaelVehrs below).

Now I want to search a directory of more than 800 files for those files that contain at least one of the string combinations (anywhere in the file). My original idea was to use egrep with a pattern file like this:

egrep -i -l -r -f strings.txt file_directory

However, I can only get this to work if there is one string per line. This is not desirable, because I need the identified files to contain all three strings of a given pattern. Is there a way to add some kind of AND operator to the grep pattern file? Or is there another way to achieve what I want using another function/tool? Many thanks!

Edit 2017/01/30

The answer by @MichaelVehrs below was very helpful; I edited it to the following:

while read one two three four five six
do grep -ilFr "$one $two" *files* | xargs grep -ilFr "$three $four" |  xargs grep -ilFr "$five $six"
done < *patternfile* | sort -u

This answer works when the pattern file contains the strings without quotes. Sadly, it only seems to match the pattern on the first line of the pattern file. Does anyone know why?

Edit 2017/01/29

A similar question about grepping multiple values has been asked before, but I need the AND logic in order to match one of the three-string-combinations from the pattern file strings.txt in the other files. I realise that the format of strings.txt might have to be changed for the matching to work and would appreciate suggestions.

Best Answer

Since agrep seems not to be present in your system, have a look in this alternative based on sed and awk to apply grep with and operation from patterns read by a local file.

PS: Since you use osx i'm not sure if the awk version you have will support bellow usage.

awk can simulate grep with AND operation of multiple patterns in this usage:
awk '/pattern1/ && /pattern2/ && /pattern3/'

So you could transform your pattern file from this:

$ cat ./tmp/d1.txt
"surveillance data" "surveillance technology" "cctv camera"
"social media" "surveillance techniques" "enforcement agencies"
"social control" "surveillance camera" "social security"
"surveillance data" "security guards" "social networking"
"surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

To this:

$ sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' ./tmp/d1.txt
/surveillance data/ && /surveillance technology/ && /cctv camera/
/social media/ && /surveillance techniques/ && /enforcement agencies/
/social control/ && /surveillance camera/ && /social security/
/surveillance data/ && /security guards/ && /social networking/
/surveillance mechanisms/ && /cctv surveillance/ && /contemporary surveillance/

PS: You can redirect the output to another file by using >anotherfile in the end , or you can use the sed -i option to make in-place changes in the same search terms pattern file.

Then you just need to feed awk with awk-formatted patterns from this pattern file :

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt #d1.txt = my test pattern file

You could also not transform patterns in your original pattern file by applying sed in each line of this original pattern file like this:

while IFS= read -r line;do 
  line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line")
  awk "$line" *.txt
done <./tmp/d1.txt

Or as one-liner:

$ while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt

Above commands return the correct AND results in my test files that look like this:

$ cat d2.txt
This guys over there have the required surveillance technology to do the job.
The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.

$ cat d3.txt
All surveillance data are locked.
All surveillance data are locked and guarded by security guards.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

Results:

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt
#or while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt
The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

Update:
Above awk solution prints the contents of matching txt files.
If you want to display the filenames instead of the contents, then use the following awk where necessary:

awk "$line""{print FILENAME}" *.txt
Related Question