AWK/GREP – Print Every Line with Match if Column Matches Another File

awkgrep

I’m taking two input files, one with certain ID numbers, and another with a large list of ID numbers and additional columns. The latter file contains multiple lines for each ID number and I need to extract all lines that match an ID from the first file. Those lines then must be printed in a new file.

Edit 1: Replaced sample files with excerpts from actual

Edit 2: Removed extra spaces that were in excerpt, but not actual file. Files likely need to be sanitized in some way, but how is unclear.

file1:

AT1G56430
AT3G55190
AT3G22880

file2:

AT1G01010|GO:0043090|RCA
AT1G56430|GO:0010233|IGI 
AT1G56430|GO:0009555|IGI 
AT1G56430|GO:0030418|IGI

expected output

AT1G56430|GO:0010233|IGI 
AT1G56430|GO:0009555|IGI 
AT1G56430|GO:0030418|IGI

[ file1ss [ file2ss

I have tried:

awk -F'|' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' file1 file2 > output.txt

and:

grep -Ff file2 file1 > output.txt

I’m aware that there are many somewhat similar questions posted in these forums and others. However, these don’t mention how the output is handled… nor do they mention duplicates. I’ve tried solutions from 4 of them, have been messing with this for many hours and keep getting the same problem: a blank output file.

I’m new to awk and I greatly appreciate the help. Sorry if this is a simple problem with syntax etc; please let me know. Thanks for the help.

Best Answer

Your AWK script is nearly there:

awk -F'|' 'NR==FNR{c[$1]++;next};c[$1] > 0' file1 file2 > output.txt

works, after changing the line-endings from Mac to Unix:

tr '\r' '\n' < file1 > file1.new
mv file1.new file1
tr '\r' '\n' < file2 > file2.new
mv file2.new file2

$1 is the first field in AWK.

Instead of c[$1] > 0, you can write c[$1]. The > 0 isn't needed: any non-zero value works, so we might as well use the contents of c directly:

awk -F'|' 'NR==FNR{c[$1]++;next};c[$1]' file1 file2 > output.txt

Related Solutions

Awk Join – How to Join Two Files with Matching Columns

join works great:

$ join <(sort File1.txt) <(sort File2.txt) | column -t | tac
 id                           No       P   R   S
 gi|371443198|gb|JH556662.1|  7573913  2   2   0
 gi|371440577|gb|JH559283.1|  6931777  21  19  2

ps. does ouput column order matter?

if yes use:

$ join <(sort 1) <(sort 2) | tac | awk '{print $1,$3,$4,$5,$2}' | column -t
 id                           P   R   S  No
 gi|371443198|gb|JH556662.1|  2   2   0  7573913
 gi|371440577|gb|JH559283.1|  21  19  2  6931777

Compare two files and print matches – large files

If the files are sorted (the samples you posted are) then it's as simple as

join -t : File1.txt File2.txt

join pairs up lines from two files where the join field is equal. By default, the join field is the first field, the fields are output in order except that the join field is not repeated, and non-pairable lines are skipped, which is exactly what you want.

Note that if the files have Windows line endings, they appear under Unix systems to have an extra carriage return character at the end of each line. The CR is mostly visually invisible, but as far as join and other text tools are concerned, it's a character like any one else, and it means the fields of File1.txt all end with a CR whereas the ones in File2.txt don't so they don't match. You need to strip the CR, at least in File1.txt.

<File1.txt tr -d '\r' | join -t : - File2.txt

You do need to sort the files. If they aren't, then ksh/bash/zsh, you can use process substitutions. (Add tr -d '\r' | if needed.)

join -t : <(sort File1.txt) <(sort File2.txt)

In plain sh, if your Unix variant has /dev/fd (most do), you can use that instead to pipe the output of two programs through two file descriptors.

sort File2.txt | { sort File1.txt | join -t : /dev/fd/0 /dev/fd/3; } 3<&1

If you need to preserve the original order of File1.txt and it isn't sorted by the join field, then add line numbers to remember the original order, sort by the join field, join, sort by line numbers and strip the line numbers. (You can do something similar if you want to preserver the order of the other file.)

<File1.txt nl -s : |
sort -t : -k 2 |
join -t : -1 2 - <(sort File2.txt) |
sort -t : -k 2,2n |
cut -d : -f 1,3

Best Answer

Related Solutions

Awk Join – How to Join Two Files with Matching Columns

Compare two files and print matches – large files

Related Question