I have two files, let's call them 123.txt
and 789.txt
. 123.txt
is 2.5M lines long, and 789.txt
is 65M lines long. Is there any way to use grep
or similar to keep any lines from 789.txt
that contain lines from 123.txt?
There will be a max of one duplicate per line in 789.txt
, and the duplicate text will be at the beginning of the line. I'm totally stuck on this, and couldn't find any info online, so I don't really have anything to start with. It will be running on a server, so I don't mind it taking a while (which I know it will)
-
123.txt:
hxxp://www.a.com hxxp://www.b.com hxxp://www.c.com
-
789.txt:
hxxp://www.a.com/kgjdk-jgjg/ hxxp://www.b.com/gsjahk123/ hxxp://www.c.com/abc.txt hxxp://www.d.com/sahgsj/
-
Desired output:
hxxp://www.a.com/kgjdk-jgjg/ hxxp://www.b.com/gsjahk123/ hxxp://www.c.com/abc.txt
Best Answer
You can do this very easily using
grep
:The command above will print all lines from file
789.txt
that contain any of the lines from123.txt
. The -f means "read the patterns to search from this file" and the -F tells grep to treat the search patterns as strings and not its default regular expressions.This will not work if the lines of
123.txt
contain trailing spaces,grep
will treat the spaces as part of the pattern to look for an will not match if it occurs within a word. For example, the patternfoo
(note the trailing space) will not match
foobar
. To remove trailing spaces from your file, run this command:Then use the
new_file
to grep:You can also do this without a new file, using the
i
flag:This will change file
123.txt
and keep a copy of the original called123.txt.bak
.(Note that this form of the
-i
flag tosed
assumes you have GNUsed
; for BSDsed
use-i .bak
with a space in between.)