I have a relatively small space delimited file which looks like this (but with many columns and rows):
file1:
Entry1
a
b
c
d
And another huge tab delimited file with many duplicated entries per row like this:
file2:
value ID1 ID2
1 a aaaa1
1 a aaaa2
1 b bbbb1
1 b bbbb2
1 b bbbb3
1 d aaaa4
I am trying to first print a tab delimited file of the same size as file1 specifying matches between each entry in file1 and ID1 field (column 2) in file2 like this this is a separate question:
Entry1
a
b
NoMatch
d
And also print a merged tab delimited file with the values of file1 and file2, this time with repeated entries if there are in file2, but keeping the NoMatch as well, like this:
value ID1 ID2
1 a aaaa1
1 a aaaa2
1 b bbbb1
1 b bbbb2
1 b bbbb3
NoMatch NoMatch NoMatch
1 d aaaa4
I have tried using the join command to give me a NoMatch message:
join -a1 -e "NoMatch" <(awk -F ' ' '{print $0}' file1.txt |tail -n +2|sort -k1,1) <(cat file2 | sort -k1,1) > out.txt
but I must be doing something wrong because this prints all of file1 for example, but if there is no match in file2 the rest of the fields are blank (so no message "NoMatch" appears), can you please help me understand what I am doing wrong?
Thanks so much!
#
Thank you gilles, I can't seem to answer your comment as I just registered…
What you suggested:
join -a1 -e "NoMatch" -11 -22 --header -o2.1,2.2,2.3 file1 file2
solved the second query Output2, thank you! Can I ask how I can get the first Output1, with the same lines as the file1? Thanks again!!
With don_crissti's help, I can get the second output like this:
paste -d'\t' file2 <(awk 'FNR==NR{seen[$1]++; next} {(FNR==1 || ($1 in seen)) || $1="NoMatch"};1' file2 file1)
Best Answer
Reformat the original example for readability:
All of the problems are with file2 processing and with the
join
options.-t '<tab>'
option forjoin
, file2 is tab delimited.join
option to join on field 2 of file2.join
field for file2, the default output prints the join field first, so you have to specify the-o FORMAT
option onjoin
.tail
is unneeded on file1 becauseawk
can skip the first line.-F ' '
option toawk
.Correcting those issues, yields this:
Which yields the following tab-delimited output: