Extract lines that match a list of words in another file

awkbioinformaticsgrepsed

I have file 1 which have those lines:

ATM 1434.972183
BMPR2 10762.78192
BMPR2 10762.78192
BMPR2 1469.14535
BMPR2 1469.14535
BMPR2 1738.479639
BMS1 4907.841667
BMS1 4907.841667
BMS1 880.4532628
BMS1 880.4532628
BMS1P17 1249.75
BMS1P17 1249.75
BMS1P17 1606.821429
BMS1P17 1606.821429
BMS1P17 1666.333333
BMS1P17 1666.333333
BMS1P17 2108.460317
BMS1P17 2108

And file 2 have a list of words:

ATM
BMS1

So, the output will be like this:

ATM 1434.972183
BMS1 4907.841667
BMS1 4907.841667
BMS1 880.4532628
BMS1 880.4532628

I know it's really a duplicate question, but I tried all types of grep and sed and awk, maybe it will works with you guys with this tiny example
but I have a very huge file > 1M lines and all previous way doesn't help

it return part of the lines that containing those words although there are other words in file 2 that matches the lines from file 1

Best Answer

grep -Fw -f words myfile

This would extract the lines in myfile that contains the words in the file words anywhere.

The strings in words are treated as fixed strings (not regular expressions) due to the -F option, and the -w option ensures that we only get lines that contains the exact same word (no matches of substrings in words are allowed). A word is a consecutive sequence of characters from the set of alphanumerical characters and the underscore character.

The words in the file words most be listed on separate lines.

Related Solutions

How to Use Grep/Awk/Unix to Match All Lines from One File in Another File

Via awk

awk 'NR==FNR{A[$4]=$0;next}{print A[$1]}' file2.txt file1.txt

Or sorted output via join:

join -o 2.1 2.2 2.3 2.4 -2 4 <(sort file1.txt) <(sort -k4 file2.txt)

Extract Lines from Bottom Until Regex Match – Using AWK or SED

This feels a bit silly, but:

$ tac file.txt |sed -e '/^virt-top/q' |tac
virt-top time  11:25:17 Host foo.example.com x86_64 32/32CPU 1200MHz 65501MB
   ID S RDRQ WRRQ RXBY TXBY %CPU %MEM   TIME    NAME
    1 R    0    0    0    0  0.6 12.0  96:02:53 instance-0000036f
    2 R    0    0    0    0  0.2 12.0  95:44:08 instance-00000372

GNU tac reverses the file (many non-GNU systems have tail -r instead), the sed picks lines until the first that starts with virt-top. You can add sed 1,2d or tail -n +3 to remove the headers.

Or in awk:

$ awk '/^virt-top/ { a = "" } { a = a $0 ORS } END {printf "%s", a}' file.txt 
virt-top time  11:25:17 Host foo.example.com x86_64 32/32CPU 1200MHz 65501MB
   ID S RDRQ WRRQ RXBY TXBY %CPU %MEM   TIME    NAME
    1 R    0    0    0    0  0.6 12.0  96:02:53 instance-0000036f
    2 R    0    0    0    0  0.2 12.0  95:44:08 instance-00000372

It just collects all the lines to a variable, and clears that variable on a line starting with virt-top.

If the file is very large, the tac+sed solution is bound to be faster since it only needs to read the tail end of the file while the awk solution reads the full file from the top.

Best Answer

Related Solutions

How to Use Grep/Awk/Unix to Match All Lines from One File in Another File

Extract Lines from Bottom Until Regex Match – Using AWK or SED

Related Question