Ubuntu – Awk – compare 2 files and print columns from both files

awkbashcommand linetext processing

I posted something similar a while ago and I thought, the code provided could help in solving my problem, however unfortunately I am not able to adjust it to my needs: awk – compare files and print lines from both files

So, I have again 2 tab-separated files.

file_1.txt

apple    2.5    5     7.2
great    3.8    10    3.6
see      7.6    3     4.9
tree     5.4    11    5
back     8.9    2     2.1

file_2.txt

apple    :::N
back     :::ADJ
back     :::N      
around   :::ADV      
great    :::ADJ         
bee      :::N         
see      :::V      
tree     :::N         

The output should look like:

apple    :::N      2.5    5     7.2     
great    :::ADJ    3.8    10    3.6
back     :::ADJ    8.9    2     2.1
back     :::N      8.9    2     2.1
see      :::V      7.6    3     4.9
tree     :::N      5.4    11    5 

The difference to the other post is, that I just like to compare the first columns of file_1.txt and file_2.txt and then print the whole line of file_1.txt with column 2 of file_1.txt to the outfile. I do not care in which order $2 of file_2.txt is printed to the outfile, so the outfile could as well look like

back     8.9    2     2.1    :::N
back     8.9    2     2.1    :::V etc.

The problem are the duplicates in column1 as back here. Otherwise I could of course just use paste.
The problem with this `awk-command is, that it does not read column2 in the a array and if I tell it to print it, this is not possible of course.

awk 'NR==FNR {a[$1]; next} $1 in a {print $0, a[$2]}' OFS='\t' file_2.txt file_1.txt > outfile.txt

I am gladly appreciating any help! Sorry for the stupidity here also, seems that I am completely stumped.

Best Answer

If you have GNU awk (available from the repository via package gawk), which supports multi-dimensional arrays, you could do

gawk 'NR==FNR {a[$1][$2]++; next} $1 in a {for (x in a[$1]) print $0, x}' OFS="\t" file_2.txt file_1.txt

Ex.

$ gawk 'NR==FNR {a[$1][$2]++; next} $1 in a {for (x in a[$1]) print $0, x}' OFS="\t" file_2.txt file_1.txt
apple   2.5     5       7.2     :::N
great   3.8     10      3.6     :::ADJ
see     7.6     3       4.9     :::V
tree    5.4     11      5       :::N
back    8.9     2       2.1     :::ADJ
back    8.9     2       2.1     :::N

Otherwise, if output order is not important the easiest solution is probably to use the join command instead:

$ join -t $'\t' <(sort file_1.txt) <(sort file_2.txt)
apple   2.5     5       7.2     :::N
back    8.9     2       2.1     :::ADJ
back    8.9     2       2.1     :::N
great   3.8     10      3.6     :::ADJ
see     7.6     3       4.9     :::V
tree    5.4     11      5       :::N
Related Question