Bash – Merge and print matching and non matching values between a smaller file and a huge file

bashjoin;text processing

I have a relatively small space delimited file which looks like this (but with many columns and rows):

file1:

Entry1
a
b
c
d

And another huge tab delimited file with many duplicated entries per row like this:

file2:

value       ID1   ID2   
1      a    aaaa1
1      a    aaaa2
1      b    bbbb1
1      b    bbbb2
1      b    bbbb3
1      d    aaaa4

I am trying to first print a tab delimited file of the same size as file1 specifying matches between each entry in file1 and ID1 field (column 2) in file2 like this this is a separate question:

Entry1
a
b
NoMatch
d

And also print a merged tab delimited file with the values of file1 and file2, this time with repeated entries if there are in file2, but keeping the NoMatch as well, like this:

value       ID1   ID2   
1      a    aaaa1
1      a    aaaa2
1      b    bbbb1
1      b    bbbb2
1      b    bbbb3
NoMatch NoMatch NoMatch
1      d    aaaa4

I have tried using the join command to give me a NoMatch message:

join -a1 -e "NoMatch" <(awk -F ' ' '{print $0}' file1.txt |tail -n +2|sort -k1,1) <(cat file2 | sort -k1,1) > out.txt

but I must be doing something wrong because this prints all of file1 for example, but if there is no match in file2 the rest of the fields are blank (so no message "NoMatch" appears), can you please help me understand what I am doing wrong?

Thanks so much!

#

Thank you gilles, I can't seem to answer your comment as I just registered…
What you suggested:

join -a1 -e "NoMatch" -11 -22 --header -o2.1,2.2,2.3 file1 file2

solved the second query Output2, thank you! Can I ask how I can get the first Output1, with the same lines as the file1? Thanks again!!

With don_crissti's help, I can get the second output like this:

paste -d'\t' file2 <(awk 'FNR==NR{seen[$1]++; next}  {(FNR==1 || ($1 in seen)) || $1="NoMatch"};1'  file2  file1)

Best Answer

Reformat the original example for readability:

join -a1 -e "NoMatch" \
 <(awk -F ' ' '{print $0}' file1.txt |tail -n +2|sort -k1,1) \
 <(cat file2 | sort -k1,1)

All of the problems are with file2 processing and with the join options.

  1. file2 is sorted on field 1, should be field 2.
  2. No -t '<tab>' option for join, file2 is tab delimited.
  3. Missing join option to join on field 2 of file2.
  4. Even with the correct join field for file2, the default output prints the join field first, so you have to specify the -o FORMAT option on join.
  5. file2 header is not removed.
  6. Not causing a problem, but unneeded processing: tail is unneeded on file1 because awk can skip the first line.
  7. Unneeded -F ' ' option to awk.

Correcting those issues, yields this:

#!/bin/bash
head -1 file2
join -t '   ' -2 2 -a 1 -e NoMatch -o 2.1,2.2,2.3 \
 <(awk 'NR==1{next} {print $0}' file1.txt | sort) \
 <(tail -n +2 file2 | sort -k2)

Which yields the following tab-delimited output:

value   ID1     ID2
1       a       aaaa1
1       a       aaaa2
1       b       bbbb1
1       b       bbbb2
1       b       bbbb3
NoMatch NoMatch NoMatch
1       d       aaaa4
Related Question