With help from this answer
awk 'FNR==NR && FNR>1 {a[$2] = $5; next}
FNR > 1 && ($2 in a) && $3 == "ALL" {
print $1 " " $2 " " a[$2] " " $9
}' file2 file1
To get the header as well, just add this to the beginning of the script:
BEGIN{print "CHR SNP MAF P"}
Explanation:
First of all, when two files are passed to awk, they are processed one after another. There are two variables important here: NR
is the line number from the beginning of the awk
command, and FNR
is the line number from the beginning of the current file. That is, when the first file is processed (here file2), NR
and FNR
have the same value, which is the value of the line currently processed. But when awk pass to second file, FNR
is reset to 1, so that NR
and FNR
are no longer the same. So that the test FNR==NR
is a trick for knowing if the file processed is the first or not.
So let's see the code. The condition FNR==NR && FNR>1
tests whether we are processing the first file and not the first line. If it's the case, we store the value of fifth column (MAF
) in an array indexed by the second one (SNP
) and then the next
statement says to pass to the following line.
When awk processes the second file (which is file1), the first test is false, so that awk tries the second test: FNR > 1 && ($2 in a) && $3 == "ALL"
, that is: not the first line of the file + second column value (SNP
) exists in table a
+ third column value (TEST
) is "ALL"
. If it is the case, then it prints column 1 (CHR
) and two (SNP
), gets the MAF
value from the array with a[$2]
, and then prints column nine (P
).
Adding a BEGIN{...}
statement at the beginning adds a command which is run only before the first line is processed.
Your files were created in Windows so they have Windows style line-endings (\r\n
). Remove the \r
and everything should work as you expect:
sed -i 's/\r//' File1
sed -i 's/\r//' File2
awk 'FNR==NR{a[$4]=$5;next} {print $1,$2,$3,$4,a[$4]}' File2 File1 > file3
Best Answer
This should do it:
Important: this assumes your files are sorted (as in your example) according to the SNP name. If they are not, sort them first:
Output:
Explanation (from
info join
):So, the command above joins the files on the second field and prints the 1st,2nd and 3rd field of file one, followed by the 3rd field of file2.