text-processing – How to Merge Two Files Based on Matching Columns

awkbioinformaticsjoin;text processing

I have file1 likes:

0   AFFX-SNP-000541  NA
0   AFFX-SNP-002255  NA
1   rs12103          0.6401
1   rs12103_1247494  0.696
1   rs12142199       0.7672

And a file2:

0   AFFX-SNP-000541   1
0   AFFX-SNP-002255   1
1   rs12103           0.5596
1   rs12103_1247494   0.5581
1   rs12142199        0.4931

And would like a file3 such that:

0   AFFX-SNP-000541     NA       1
0   AFFX-SNP-002255     NA       1
1   rs12103             0.6401   0.5596
1   rs12103_1247494     0.696    0.5581
1   rs12142199          0.7672   0.4931

Which means to put the 4th column of file2 to file1 by the name of the 2nd column.

Best Answer

This should do it:

join -j 2 -o 1.1,1.2,1.3,2.3 file1 file2

Important: this assumes your files are sorted (as in your example) according to the SNP name. If they are not, sort them first:

join -j 2 -o 1.1,1.2,1.3,2.3 <(sort -k2 file1) <(sort -k2 file2)

Output:

0 AFFX-SNP-000541 NA 1
0 AFFX-SNP-002255 NA 1
1 rs12103 0.6401 0.5596
1 rs12103_1247494 0.696 0.5581
1 rs12142199 0.7672 0.4931

Explanation (from `info join`):

`join' writes to standard output a line for each pair of input lines that have identical join fields.

`-1 FIELD'
     Join on field FIELD (a positive integer) of file 1.

`-2 FIELD'
     Join on field FIELD (a positive integer) of file 2.

`-j FIELD'
     Equivalent to `-1 FIELD -2 FIELD'.

`-o FIELD-LIST'

 Otherwise, construct each output line according to the format in
 FIELD-LIST.  Each element in FIELD-LIST is either the single
 character `0' or has the form M.N where the file number, M, is `1'
 or `2' and N is a positive field number.

So, the command above joins the files on the second field and prints the 1st,2nd and 3rd field of file one, followed by the 3rd field of file2.

Related Solutions

Shell – How to Merge Two Files with Different Number of Rows

With help from this answer

awk 'FNR==NR && FNR>1 {a[$2] = $5; next}
     FNR > 1 && ($2 in a) && $3 == "ALL" {
         print $1 "    " $2 "    "  a[$2] "    "  $9
     }' file2 file1

To get the header as well, just add this to the beginning of the script:

 BEGIN{print "CHR SNP MAF P"}

Explanation:

First of all, when two files are passed to awk, they are processed one after another. There are two variables important here: NR is the line number from the beginning of the awk command, and FNR is the line number from the beginning of the current file. That is, when the first file is processed (here file2), NR and FNR have the same value, which is the value of the line currently processed. But when awk pass to second file, FNR is reset to 1, so that NR and FNR are no longer the same. So that the test FNR==NR is a trick for knowing if the file processed is the first or not.

So let's see the code. The condition FNR==NR && FNR>1 tests whether we are processing the first file and not the first line. If it's the case, we store the value of fifth column (MAF) in an array indexed by the second one (SNP) and then the next statement says to pass to the following line.

When awk processes the second file (which is file1), the first test is false, so that awk tries the second test: FNR > 1 && ($2 in a) && $3 == "ALL", that is: not the first line of the file + second column value (SNP) exists in table a + third column value (TEST) is "ALL". If it is the case, then it prints column 1 (CHR) and two (SNP), gets the MAF value from the array with a[$2], and then prints column nine (P).

Adding a BEGIN{...} statement at the beginning adds a command which is run only before the first line is processed.

Merging two files according to a common column

Your files were created in Windows so they have Windows style line-endings (\r\n). Remove the \r and everything should work as you expect:

sed -i 's/\r//' File1
sed -i 's/\r//' File2
awk 'FNR==NR{a[$4]=$5;next} {print $1,$2,$3,$4,a[$4]}' File2 File1 > file3