How to Join Multiple Files Using Text Processing

join;text processing

I have multiple .txt files and would like to merge them together base on their first column (numeric), and fill in "NULL" if there's any missing data.

File_1:

1 a
2 b
3 c

File_2:

3 c
4 d
5 e

File_3:

4 d
5 e
6 f

Expected_Output:

1 a NULL NULL
2 b NULL NULL
3 c c NULL
4 NULL d d
5 NULL e e
6 NULL NULL f

join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file_1 file_2 | join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL - file_3 > expected_output This command give me correct output of column 1, 2 and 3 ;however, "NULL" is missing in column 4, any idea how to fix it?
Also is there a better way to merge multiple files instead writing a super long pipe command?

Best Answer

You are almost there. Using your command, we get:

$ join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file_1 file_2 | join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL - file_3
1       a       NULL
2       b       NULL
3       c       c
4       NULL    d       d
5       NULL    e       e
6       f

Lines just don't have the same number of columns because we are not setting a format for the right-hand join in the pipeline.

If we add it as -o 0,1.2,1.3,2.2 (the join field + the second and third columns from the first join + the second column of file_3):

$ join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file_1 file_2 | join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,1.3,2.2 - file_3
1       a       NULL    NULL
2       b       NULL    NULL
3       c       c       NULL
4       NULL    d       d
5       NULL    e       e
6       NULL    NULL    f

Finally, if we can assume the GNU implementation of join, we can let it do the job of inferring the right format and use -o auto instead of -o 0,1.2,2.2 and -o 0,1.2,1.3,2.2, provided that, for each file, all lines have at most the same number of fields as the first one. Quoting info join:

-o auto
If the keyword auto is specified, infer the output format from the first line in each file. This is the same as the default output format but also ensures the same number of fields are output for each line. Missing fields are replaced with the -e option and extra fields are discarded.

Related Question