Text Processing – Print Lines from One File if Part Appears in Another

greptext processing

I have two files, let's call them 123.txt and 789.txt. 123.txt is 2.5M lines long, and 789.txt is 65M lines long. Is there any way to use grep or similar to keep any lines from 789.txt that contain lines from 123.txt?

There will be a max of one duplicate per line in 789.txt, and the duplicate text will be at the beginning of the line. I'm totally stuck on this, and couldn't find any info online, so I don't really have anything to start with. It will be running on a server, so I don't mind it taking a while (which I know it will)

123.txt:

hxxp://www.a.com
hxxp://www.b.com
hxxp://www.c.com

789.txt:

hxxp://www.a.com/kgjdk-jgjg/
hxxp://www.b.com/gsjahk123/
hxxp://www.c.com/abc.txt
hxxp://www.d.com/sahgsj/

Desired output:

hxxp://www.a.com/kgjdk-jgjg/
hxxp://www.b.com/gsjahk123/
hxxp://www.c.com/abc.txt

Best Answer

You can do this very easily using grep:

$ grep -Ff 123.txt 789.txt
http://www.a.com/kgjdk-jgjg/ 
http://www.b.com/gsjahk123/ 
http://www.c.com/abc.txt

The command above will print all lines from file 789.txt that contain any of the lines from 123.txt. The -f means "read the patterns to search from this file" and the -F tells grep to treat the search patterns as strings and not its default regular expressions.

This will not work if the lines of 123.txt contain trailing spaces, grep will treat the spaces as part of the pattern to look for an will not match if it occurs within a word. For example, the pattern foo (note the trailing space) will not match foobar. To remove trailing spaces from your file, run this command:

$ sed 's/ *$//' 123.txt > new_file

Then use the new_file to grep:

$ grep -Ff new_file 789.txt

You can also do this without a new file, using the i flag:

$ sed -i.bak 's/ *$//' 123.txt

This will change file 123.txt and keep a copy of the original called 123.txt.bak.

(Note that this form of the -i flag to sed assumes you have GNU sed; for BSD sed use -i .bak with a space in between.)

Related Solutions

Grep – How to Identify Patterns That Aren’t Matched

With GNU grep the following should work. Using the -f option, pass file1.txt as a "pattern file" - but also pass it in a second time as a data file. Use -o to report only the matching parts. Finally extracts those words that match only once - these correspond to the lines from file1.txt that do not find a match in file2.txt.

grep -h -o -f  file1.txt file2.txt file1.txt | sort | uniq -u
ijkl

Logs – Filtering Multi-lines from a Log

There is no need to mix many instruments. Task can be done by sed only

sed '/^INFO\|^DEBUG\|^TRACE\|^ERROR/{
         /Logger2/{
             :1
             N
             /\nINFO\|\nDEBUG\|\nTRACE\|\nERROR/!s/\n//
             $!t1
             D     }
                                    }' log.entry

Best Answer

Related Solutions

Grep – How to Identify Patterns That Aren’t Matched

Logs – Filtering Multi-lines from a Log

Related Question