Awk Join – How to Join Two Files with Matching Columns

awkjoin;

File1.txt

    id                            No
    gi|371443199|gb|JH556661.1| 7907290
    gi|371443198|gb|JH556662.1| 7573913
    gi|371443197|gb|JH556663.1| 7384412
    gi|371440577|gb|JH559283.1| 6931777

File2.txt

 id                              P       R       S
 gi|367088741|gb|AGAJ01056324.1| 5       5       0
 gi|371443198|gb|JH556662.1|     2       2       0
 gi|367090281|gb|AGAJ01054784.1| 4       4       0
 gi|371440577|gb|JH559283.1|     21      19      2

output.txt

 id                              P       R       S  NO
 gi|371443198|gb|JH556662.1|     2       2       0  7573913
 gi|371440577|gb|JH559283.1|     21      19      2  6931777

File1.txt has two columns & File2.txt has four columns. I want to join both files which has unique id (array[1] should match in both files (file1.txt & file2.txt)
and give ouput only matched id (see output.txt).

I have tried join -v <(sort file1.txt) <(sort file2.txt). Any help with awk or join commands requested.

Best Answer

join works great:

$ join <(sort File1.txt) <(sort File2.txt) | column -t | tac
 id                           No       P   R   S
 gi|371443198|gb|JH556662.1|  7573913  2   2   0
 gi|371440577|gb|JH559283.1|  6931777  21  19  2

ps. does ouput column order matter?

if yes use:

$ join <(sort 1) <(sort 2) | tac | awk '{print $1,$3,$4,$5,$2}' | column -t
 id                           P   R   S  No
 gi|371443198|gb|JH556662.1|  2   2   0  7573913
 gi|371440577|gb|JH559283.1|  21  19  2  6931777

Related Solutions

Matching Five Columns in two Files using Awk

awk '
    {
        key = $1 SUBSEP $2 SUBSEP $4
    }
    # here, we are reading file1
    NR == FNR {
        f1_line[key] = $0 
        next
    }
    # here, we are reading file2
    key in f1_line && ($5 == "." || $5 == $4) {
        print f1_line[key], $0
    }
' file1 file2

outputs

s2/80   20      .       A       T       86      F=5;U=4 s2/80   20      .       A       A       20      F=5;U=4
s2/20   10      .       G       T       90      F=5;U=4 s2/20   10      .       G       .       99      F=5;U=4

Compare two files and print matches – large files

If the files are sorted (the samples you posted are) then it's as simple as

join -t : File1.txt File2.txt

join pairs up lines from two files where the join field is equal. By default, the join field is the first field, the fields are output in order except that the join field is not repeated, and non-pairable lines are skipped, which is exactly what you want.

Note that if the files have Windows line endings, they appear under Unix systems to have an extra carriage return character at the end of each line. The CR is mostly visually invisible, but as far as join and other text tools are concerned, it's a character like any one else, and it means the fields of File1.txt all end with a CR whereas the ones in File2.txt don't so they don't match. You need to strip the CR, at least in File1.txt.

<File1.txt tr -d '\r' | join -t : - File2.txt

You do need to sort the files. If they aren't, then ksh/bash/zsh, you can use process substitutions. (Add tr -d '\r' | if needed.)

join -t : <(sort File1.txt) <(sort File2.txt)

In plain sh, if your Unix variant has /dev/fd (most do), you can use that instead to pipe the output of two programs through two file descriptors.

sort File2.txt | { sort File1.txt | join -t : /dev/fd/0 /dev/fd/3; } 3<&1

If you need to preserve the original order of File1.txt and it isn't sorted by the join field, then add line numbers to remember the original order, sort by the join field, join, sort by line numbers and strip the line numbers. (You can do something similar if you want to preserver the order of the other file.)

<File1.txt nl -s : |
sort -t : -k 2 |
join -t : -1 2 - <(sort File2.txt) |
sort -t : -k 2,2n |
cut -d : -f 1,3

Best Answer

Related Solutions

Matching Five Columns in two Files using Awk

Compare two files and print matches – large files

Related Question