Joining entries based off of column using awk/join

awkjoin;text processing

I have two files which are pipe-delimited and may have column 1+column2 matches in both, or one file may have the entry while the other does not. Assume my match-key I am going off of equals $1"-"$2 using a pipe '|' as the FS.

file1

1111|AAA|foo|50
1111|BBB|foo|30
2222|BBB|foo|10

file2

1111|AAA|bar|10
1111|CCC|bar|20
3333|AAA|bar|40

The desired output would be the following for the first entry (I have this working)

1111|AAA|50|10

For the second entry file1 (If there is no matching column1+column2 in both files, replace the entry which is missing for foo as 0. And the other way around)

1111|BBB|30|0

And for an entry key (column1+column2) in file2, but not in file1 (This is entry 3 of file 2 expected output)

3333|AAA|0|40

So, desired output overall format is listing ALL unique keys which are represented by column1+column2 in BOTH files. With the 3rd column entries being those values from file1 column 4 (or 0 if value doesn't exist in file1) and the 4th column in output as those values in column 4 of file 2 (or 0 if value doesn't exist in file2).

I have done a lot of research and tried many things but I have values not outputting if the column1+column2 pair exists in file2 but not file1 by using the following:

join -t"|" -e0 -a1 -a2 -o 1.2,1.3,1.5,2.5 <(<file1 awk -F"|" '{print $1"-"$2"|"$0}' | sort -k1,1) <(<file2 awk -F"|" '{print $1"-"$2"|"$0}' | sort -k1,1)

The above case gives me expected output if there is a column1+column2 match in file1 but not file2, and appends a 0 for the match not existing… How can I get this to work for ALL scenarios?

Above command will do some process substitution by adding a key in column 1 in both files which is column1+column2, and then join based off of that new key. -e0 will add a 0 if this key exists in file1 but not file2. How can I get it to cover the case of: New key (column1-column2) exists in file 2 but NOT file 1?

Best Answer

With your approach you have to use join twice (or change your approach to do it with a single join invocation) :

  • print the common lines and the unpairable lines from file1 with join -t'|' -e0 -a1 -o 1.2,1.3,1.5,2.5 <(<file1 awk -F'|' '{print $1"-"$2"|"$0}' | sort -t'|' -k1,1) <(<file2 awk -F'|' '{print $1"-"$2"|"$0}' | sort -t'|' -k1,1)
  • print the unpairable lines from file2 with join -t'|' -e0 -v2 -o 2.2,2.3,1.5,2.5 <(<file1 awk -F'|' '{print $1"-"$2"|"$0}' | sort -t'|' -k1,1) <(<file2 awk -F'|' '{print $1"-"$2"|"$0}' | sort -t'|' -k1,1)

You can do the same with a single awk invocation, storing $4 in two arrays indexed by e.g. $1|$2 and then in the END block iterating over each array indices, comparing them and printing accordingly:

awk -F'|' 'NR==FNR{z[$1"|"$2]=$4;next}{x[$1"|"$2]=$4}
END{for (j in x){if (!(j in z)){print j, "0", x[j]}};
for (i in z){if (i in x){print i, z[i], x[i]} else {print i, z[i], "0"}}
}' OFS="|"  file1 file2
Related Question