Shell – compare two columns of different files and print if it matches

perlshell-scripttext processing

I am using Solaris 10 and so grep options involving -f don't work.

I have two pipe-separated files:

file1:

abc|123|BNY|apple|
cab|234|cyx|orange|
def|kumar|pki|bird|

file 2:

abc|123|
kumar|pki|
cab|234

I would like to compare the first two columns of file2 with file1 (search through the entire contents of file1 in first two columns) if they match print the matched line of file1. Then search for the second line of file 2 and so on.

Expected Output:

abc|123|BNY|apple|
cab|234|cyx|orange|

The files I have are huge, containing about 400,000 lines, so I would like to make the execution fast.

Best Answer

This is what awk was designed for:

$ awk -F'|' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' file2 file1
abc|123|BNY|apple|
cab|234|cyx|orange|

Explanation

  • -F'|' : sets the field separator to |.
  • NR==FNR : NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
  • c[$1$2]++; next : if this is the 1st file, save the 1st two fields in the c array. Then, skip to the next line so that this is only applied on the 1st file.

  • c[$1$2]>0 : the else block will only be executed if this is the second file so we check whether fields 1 and 2 of this file have already been seen (c[$1$2]>0) and if they have been, we print the line. In awk, the default action is to print the line so if c[$1$2]>0 is true, the line will be printed.


Alternatively, since you tagged with Perl:

perl -e 'open(A, "file2"); while(<A>){/.+?\|[^|]+/ && $k{$&}++};
         while(<>){/.+?\|[^|]+/ && do{print if defined($k{$&})}}' file1

Explanation

The first line will open file2, read everything up to the 2nd | (.+?\|[^|]+) and save that (the $& is the result of the last match operator) in the %k hash.

The second line processes file1, uses the same regex to extract the 1st two columns and print the line if those columns are defined in the %k hash.


Both of the above approaches will need to hold the 2 first columns of file2 in memory. That shouldn't be a problem if you only have a few hundred thousand lines but if it is, you could do something like

cut -d'|' -f 1,2 file2 | while read pat; do grep "^$pat" file1; done

But that will be slower.

Related Question