Compare two files and print matches – large files

awkgreptext processing

I need to compare 2 files and print matched lines.
If file1 username is in file2 (field 1) I want to print it to new matched file.

File1.txt:

Hey123
Johnson
Hanny123
Fanny

(file1 is 240MB – 20.000.000 lines)

File2.txt:

Gromy123:hannibal
Hey123:groll
Hanny123:tronda9
Kroppsk:football23

(file2 is 1.4GB – 69.000.000 lines)

Expected matched lines output:

Hanny123:tronda9
Hey123:groll

I have been trying for 4 hours without success. Both the files are sorted and I have tried join + countless of grep / awk commands. My big problem is RAM exhaustion. Would love some help how I could approach this, so large files.

Best Answer

If the files are sorted (the samples you posted are) then it's as simple as

join -t : File1.txt File2.txt

join pairs up lines from two files where the join field is equal. By default, the join field is the first field, the fields are output in order except that the join field is not repeated, and non-pairable lines are skipped, which is exactly what you want.

Note that if the files have Windows line endings, they appear under Unix systems to have an extra carriage return character at the end of each line. The CR is mostly visually invisible, but as far as join and other text tools are concerned, it's a character like any one else, and it means the fields of File1.txt all end with a CR whereas the ones in File2.txt don't so they don't match. You need to strip the CR, at least in File1.txt.

<File1.txt tr -d '\r' | join -t : - File2.txt

You do need to sort the files. If they aren't, then ksh/bash/zsh, you can use process substitutions. (Add tr -d '\r' | if needed.)

join -t : <(sort File1.txt) <(sort File2.txt)

In plain sh, if your Unix variant has /dev/fd (most do), you can use that instead to pipe the output of two programs through two file descriptors.

sort File2.txt | { sort File1.txt | join -t : /dev/fd/0 /dev/fd/3; } 3<&1

If you need to preserve the original order of File1.txt and it isn't sorted by the join field, then add line numbers to remember the original order, sort by the join field, join, sort by line numbers and strip the line numbers. (You can do something similar if you want to preserver the order of the other file.)

<File1.txt nl -s : |
sort -t : -k 2 |
join -t : -1 2 - <(sort File2.txt) |
sort -t : -k 2,2n |
cut -d : -f 1,3

Explanation

-F'|' : sets the field separator to |.
NR==FNR : NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
c[$1$2]++; next : if this is the 1st file, save the 1st two fields in the c array. Then, skip to the next line so that this is only applied on the 1st file.
c[$1$2]>0 : the else block will only be executed if this is the second file so we check whether fields 1 and 2 of this file have already been seen (c[$1$2]>0) and if they have been, we print the line. In awk, the default action is to print the line so if c[$1$2]>0 is true, the line will be printed.

Alternatively, since you tagged with Perl:

perl -e 'open(A, "file2"); while(<A>){/.+?\|[^|]+/ && $k{$&}++};
         while(<>){/.+?\|[^|]+/ && do{print if defined($k{$&})}}' file1

Explanation

The first line will open file2, read everything up to the 2nd | (.+?\|[^|]+) and save that (the $& is the result of the last match operator) in the %k hash.

The second line processes file1, uses the same regex to extract the 1st two columns and print the line if those columns are defined in the %k hash.

Both of the above approaches will need to hold the 2 first columns of file2 in memory. That shouldn't be a problem if you only have a few hundred thousand lines but if it is, you could do something like

cut -d'|' -f 1,2 file2 | while read pat; do grep "^$pat" file1; done

But that will be slower.

Best Answer

Related Solutions

Shell – compare two columns of different files and print if it matches

Explanation

Explanation

Related Question