Best way to remove lines from a file where matching text (not whole line) exists in another file

text processing

I have a file of email addresses (file 1)
and another file (file 2) of lines of data that contain some of these email addresses

I want to compare files and remove all lines in file 2 that have a matching email in file 1.

I know I can do a loop and use sed -i
or loop and grep each line to a new file then compare files with COMM

But I was wondering if there is a comparison method that I can kind of grep all lines in file1 to file2 and just be left with the lines from file2 that do not contain any emails in file 1

file 1:

test@invalid.com
test2@invalid.com
test3@invalid.com

file 2:

23456|tom|jones|test@goodemailcom|10
12345|pete|best|pete@goodemail.com|10
87569|remove3|me3|test3@invalid.com|10
23098|mike|jones|mike@goodemailcom|10
10985|al|best|al@goodemail.com|10
09865|removve|me|test@invalid.com|10
13579|remove2|me2|test2@invalid.com|10

Appreciate any knowledge anyone has.

Best Answer

You can use fgrep for this:

fgrep -v -f file1  file2  > unique_addresses

This task will be a lot easier if you have 1 email address per line in both files.

Traditionally fgrep exists as a separate program, but in GNU utilities, grep -F does the same thing.

Related Solutions

How to remove lines included in one file from another file

grep can read multiple patterns from a file, one per line. Combine with the options -v to output non-matching lines, and -F to match strings instead of regex and -x to require that the whole line matches.

grep -Fvx -f partial.list complete.list >remaining.list &&
mv remaining.list complete.list

Obviously the second command line is only if you want to overwrite the file containing the complete list.

If the partial list is huge and you don't mind reordering the list, then join may be faster.

Shell – compare two columns of different files and print if it matches

This is what awk was designed for:

$ awk -F'|' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' file2 file1
abc|123|BNY|apple|
cab|234|cyx|orange|

Explanation

-F'|' : sets the field separator to |.
NR==FNR : NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
c[$1$2]++; next : if this is the 1st file, save the 1st two fields in the c array. Then, skip to the next line so that this is only applied on the 1st file.
c[$1$2]>0 : the else block will only be executed if this is the second file so we check whether fields 1 and 2 of this file have already been seen (c[$1$2]>0) and if they have been, we print the line. In awk, the default action is to print the line so if c[$1$2]>0 is true, the line will be printed.

Alternatively, since you tagged with Perl:

perl -e 'open(A, "file2"); while(<A>){/.+?\|[^|]+/ && $k{$&}++};
         while(<>){/.+?\|[^|]+/ && do{print if defined($k{$&})}}' file1

Explanation

The first line will open file2, read everything up to the 2nd | (.+?\|[^|]+) and save that (the $& is the result of the last match operator) in the %k hash.

The second line processes file1, uses the same regex to extract the 1st two columns and print the line if those columns are defined in the %k hash.

Both of the above approaches will need to hold the 2 first columns of file2 in memory. That shouldn't be a problem if you only have a few hundred thousand lines but if it is, you could do something like

cut -d'|' -f 1,2 file2 | while read pat; do grep "^$pat" file1; done

But that will be slower.

Best Answer

Related Solutions

How to remove lines included in one file from another file

Shell – compare two columns of different files and print if it matches

Explanation

Explanation

Related Question