How to Join Multiple Files Using Text Processing

join;text processing

I have multiple .txt files and would like to merge them together base on their first column (numeric), and fill in "NULL" if there's any missing data.

File_1:

1 a
2 b
3 c

File_2:

3 c
4 d
5 e

File_3:

4 d
5 e
6 f

Expected_Output:

1 a NULL NULL
2 b NULL NULL
3 c c NULL
4 NULL d d
5 NULL e e
6 NULL NULL f

join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file_1 file_2 | join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL - file_3 > expected_output This command give me correct output of column 1, 2 and 3 ;however, "NULL" is missing in column 4, any idea how to fix it?
Also is there a better way to merge multiple files instead writing a super long pipe command?

Best Answer

You are almost there. Using your command, we get:

$ join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file_1 file_2 | join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL - file_3
1       a       NULL
2       b       NULL
3       c       c
4       NULL    d       d
5       NULL    e       e
6       f

Lines just don't have the same number of columns because we are not setting a format for the right-hand join in the pipeline.

If we add it as -o 0,1.2,1.3,2.2 (the join field + the second and third columns from the first join + the second column of file_3):

$ join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file_1 file_2 | join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,1.3,2.2 - file_3
1       a       NULL    NULL
2       b       NULL    NULL
3       c       c       NULL
4       NULL    d       d
5       NULL    e       e
6       NULL    NULL    f

Finally, if we can assume the GNU implementation of join, we can let it do the job of inferring the right format and use -o auto instead of -o 0,1.2,2.2 and -o 0,1.2,1.3,2.2, provided that, for each file, all lines have at most the same number of fields as the first one. Quoting info join:

-o auto
If the keyword auto is specified, infer the output format from the first line in each file. This is the same as the default output format but also ensures the same number of fields are output for each line. Missing fields are replaced with the -e option and extra fields are discarded.

Related Solutions

Bash – Get lines matching a pattern in one file and put them into a second file matching the same pattern

With GNU sed:

sed -e '/^b/{R 1.txt' -e 'd}' 2.txt

if you want to edit file 2.txt "in place", add sed's option -i.

How to Merge Multiple Files with Join Command

Basically like this for your 3 files example

$ join file2 file3| join file1 -
1 test1 example1 foo1
2 test3 example2 foo2
3 test4 example3 foo3
4 test5 example4 foo4

But important all your input files must be sorted already (sort -k 1b,1, numerical sorted like your example may not work!). So the example above sorted on-the-fly could be written in bash like this:

join <(sort -k 1b,1 file2) <(sort -k 1b,1 file3) | join <(sort -k 1b,1 file1) -\
  | sort -k 1n,1

And finally the generic case for n files using a recursive function (tested in bash).:

xjoin() {
    local f
    local srt="sort -k 1b,1"

    if [ "$#" -lt 2 ]; then
            echo "xjoin: need at least 2 files" >&2
            return 1
    elif [ "$#" -lt 3 ]; then
            join <($srt "$1") <($srt "$2")
    else
            f=$1
            shift
            join <($srt "$f") <(xjoin "$@")
    fi
}

xjoin file1 file2 file3 | sort -k 1n,1

If you know what you are doing you may omit the sort pipes. But from my experience join without explicit sort is very often the cause of trouble.

Best Answer

Related Solutions

Bash – Get lines matching a pattern in one file and put them into a second file matching the same pattern

How to Merge Multiple Files with Join Command

Related Question