Join two files based on a common field

join;text processing

I'm trying to join two files together based on their common field, and output it to a file, using the join command.

File 1:

C01:Nancy:Jones:njones@abc.com 
C02:Barbara:Madison:bmadison@bcd.com 
C03:Tim:Adams:tadams@cde.com
C04:Sarah:Moore:smoore@def.com
C05:John:Polk:jpolk@efg.com
C06:Paula:Jacobs:pjacobs@fgh.com

File 2:

R001:07/04/15:123.45:C01
R002:12/20/15:167.50:C03
R003:03/14/16:298.00:C06
R005:09/15/16:36.50:C03
R005:11/27/16:58.00:C02
R006:02/28/17:72.98:C05 

Expected Output

C01:Nancy:Jones:njones@abc.com:R001:07/04/15:123.45
C02:Barbara:Madison:bmadison@bcd.com:R005:11/27/16:58.00
C03:Tim:Adams:tadams@cde.com:R002:12/20/15:167.50
C03:Tim:Adams:tadams@cde.com:R004:09/15/16:36.5
C05:John:Polk:jpolk@efg.com:R006:02/28/17:72.9
C06:Paula:Jacobs:pjacobs@fgh.com:R003:03/14/16:298.00

I've tried sorting File2 based on field 4 first and then outputting it to a new file

sort -t: -k 4 File2 > File22  

then joining

join -t: -1 1 -2 4 -o '1.1 1.2 1.3 1.4 2.1 2.2 2.3' File1 File22 > File 3  

but I always get a blank output file. I'm also not sure how to handle the omission of row 4 from File1 and the fact that C03 comes up twice in File2.

Best Answer

There are two things you need to do to get things working the way you want. First of all, by appending > File 3 to the end of the command, you are redirecting the output to File and also adding the extra argument 3 to the join command.

To explain why this happens, I recently (and coincidentally!) answered a similar question at Strange behaviour redirecting man pages. Basically, it has to do with Bash grammar, which you can read more about in man bash.

To fix that behaviour, you should either ensure there are no spaces in the output file, or you should escape the space, or you should surround the filename with quotation marks. Any of the following redirections would work:

... > File3
... > File\ 3
... > "File 3"

Now to address the missing records from the output. This is handled by the -a flag. From man join:

-a FILENUM
       also print unpairable lines from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2

Putting it all together, the following command works for me (I will use tee rather than a redirection to show the output):

$ join -t: -1 1 -2 4 -o '1.1 1.2 1.3 1.4 2.1 2.2 2.3' -a 1 File1 <(sort -t: -k4 File2) | tee "File 3"
C01:Nancy:Jones:njones@abc.com :R001:07/04/15:123.45
C02:Barbara:Madison:bmadison@bcd.com :R005:11/27/16:58.00
C03:Tim:Adams:tadams@cde.com:R002:12/20/15:167.50
C03:Tim:Adams:tadams@cde.com:R005:09/15/16:36.50
C04:Sarah:Moore:smoore@def.com:::
C05:John:Polk:jpolk@efg.com:::
C06:Paula:Jacobs:pjacobs@fgh.com:R003:03/14/16:298.00
Related Question