Text Processing – Print Lines from One File if Part Appears in Another

greptext processing

I have two files, let's call them 123.txt and 789.txt. 123.txt is 2.5M lines long, and 789.txt is 65M lines long. Is there any way to use grep or similar to keep any lines from 789.txt that contain lines from 123.txt?

There will be a max of one duplicate per line in 789.txt, and the duplicate text will be at the beginning of the line. I'm totally stuck on this, and couldn't find any info online, so I don't really have anything to start with. It will be running on a server, so I don't mind it taking a while (which I know it will)

  • 123.txt:

    hxxp://www.a.com
    hxxp://www.b.com
    hxxp://www.c.com
    
  • 789.txt:

    hxxp://www.a.com/kgjdk-jgjg/
    hxxp://www.b.com/gsjahk123/
    hxxp://www.c.com/abc.txt
    hxxp://www.d.com/sahgsj/
    
  • Desired output:

    hxxp://www.a.com/kgjdk-jgjg/
    hxxp://www.b.com/gsjahk123/
    hxxp://www.c.com/abc.txt
    

Best Answer

You can do this very easily using grep:

$ grep -Ff 123.txt 789.txt
http://www.a.com/kgjdk-jgjg/ 
http://www.b.com/gsjahk123/ 
http://www.c.com/abc.txt 

The command above will print all lines from file 789.txt that contain any of the lines from 123.txt. The -f means "read the patterns to search from this file" and the -F tells grep to treat the search patterns as strings and not its default regular expressions.

This will not work if the lines of 123.txt contain trailing spaces, grep will treat the spaces as part of the pattern to look for an will not match if it occurs within a word. For example, the pattern foo (note the trailing space) will not match foobar. To remove trailing spaces from your file, run this command:

$ sed 's/ *$//' 123.txt > new_file

Then use the new_file to grep:

$ grep -Ff new_file 789.txt

You can also do this without a new file, using the i flag:

$ sed -i.bak 's/ *$//' 123.txt

This will change file 123.txt and keep a copy of the original called 123.txt.bak.

(Note that this form of the -i flag to sed assumes you have GNU sed; for BSD sed use -i .bak with a space in between.)

Related Question