Grep – Print Unmatched Patterns Using Grep with Patterns from File

grep

patterns.txt:

"BananaOpinion"
"ExitWarning"
"SomeMessage"
"Help"
"Introduction"
"MessageToUser"

Strings.xml

<string name="Introduction">One day there was an apple that went to the market.</string>
<string name="BananaOpinion">Bananas are great!</string>
<string name="MessageToUser">We would like to give you apples, bananas and tomatoes.</string>

Expected output:

"ExitWarning"
"SomeMessage"
"Help"

How do I print the terms in patterns.txt that are not found in Strings.xml? I can print the matched/unmatched lines in Strings.xml, but how do I print the unmatched patterns? I'm using ggrep (GNU grep) version 2.21, but am open to other tools. Apologies if this is a duplicate of another question that I couldn't find.

Best Answer

You could use grep -o to print only the matching part and use the result as patterns for a second grep -v on the original patterns.txt file:

grep -oFf patterns.txt Strings.xml | grep -vFf - patterns.txt

Though in this particular case you could also use join + sort:

join -t\" -v1 -j2 -o 1.1 1.2 1.3 <(sort -t\" -k2 patterns.txt) <(sort -t\" -k2 strings.xml)

Related Solutions

Grep Awk – How to Grep a Huge Number of Patterns from a Huge File

This answer is based on the awk answer posted by potong..
It is twice as fast as the comm method (on my system), for the same 6 million lines in main-file and 10 thousand keys... (now updated to use FNR,NR)

Although awk is faster than your current system, and will give you and your computer(s) some breathing space, be aware that when data processing is as intense as you've described, you will get best overall results by switching to a dedicated database; eg. SQlite, MySQL...

awk '{ if (/^[^0-9]/) { next }              # Skip lines which do not hold key values
       if (FNR==NR) { main[$0]=1 }          # Process keys from file "mainfile"
       else if (main[$0]==0) { keys[$0]=1 } # Process keys from file "keys"
     } END { for(key in keys) print key }' \
       "mainfile" "keys" >"keys.not-in-main"

# For 6 million lines in "mainfile" and 10 thousand keys in "keys"

# The awk  method
# time:
#   real    0m14.495s
#   user    0m14.457s
#   sys     0m0.044s

# The comm  method
# time:
#   real    0m27.976s
#   user    0m28.046s
#   sys     0m0.104s

Using xargs to grep multiple patterns

Yes, find ./work -print0 | xargs -0 rm will execute something like rm ./work/a "work/b c" .... You can check with echo, find ./work -print0 | xargs -0 echo rm will print the command that will be executed (except white space will be escaped appropriately, though the echo won't show that).

To get xargs to put the names in the middle, you need to add -I[string], where [string] is what you want to be replaced with the argument, in this case you'd use -I{}, e.g. <strings.txt xargs -I{} grep {} directory/*.

What you actually want to use is grep -F -f strings.txt:

-F, --fixed-strings
  Interpret PATTERN as a  list  of  fixed  strings,  separated  by
  newlines,  any  of  which is to be matched.  (-F is specified by
  POSIX.)
-f FILE, --file=FILE
  Obtain  patterns  from  FILE,  one  per  line.   The  empty file
  contains zero patterns, and therefore matches nothing.   (-f  is
  specified by POSIX.)

So grep -Ff strings.txt subdirectory/* will find all occurrences of any string in strings.txt as a literal, if you drop the -F option you can use regular expressions in the file. You could actually use grep -F "$(<strings.txt)" directory/* too. If you want to practice find, you can use the last two examples in the summary. If you want to do a recursive search instead of just the first level, you have a few options, also in the summary.

Summary:

# grep for each string individually.
<strings.txt xargs -I{} grep {} directory/*

# grep once for everything
grep -Ff strings.txt subdirectory/*
grep -F "$(<strings.txt)" directory/*

# Same, using file
find subdirectory -maxdepth 1 -type f -exec grep -Ff strings.txt {} +
find subdirectory -maxdepth 1 -type f -print0 | xargs -0 grep -Ff strings.txt

# Recursively
grep -rFf strings.txt subdirectory
find subdirectory -type f -exec grep -Ff strings.txt {} +
find subdirectory -type f -print0 | xargs -0 grep -Ff strings.txt

You may want to use the -l option to get just the name of each matching file if you don't need to see the actual line:

-l, --files-with-matches
  Suppress  normal  output;  instead  print the name of each input
  file from which output would normally have  been  printed.   The
  scanning  will  stop  on  the  first match.  (-l is specified by
  POSIX.)

Best Answer

Related Solutions

Grep Awk – How to Grep a Huge Number of Patterns from a Huge File

Using xargs to grep multiple patterns

Related Question