I have asked similar questions here a couple times in the past with great success, but now my needs have slightly changed and I am struggling to get the exact output I am looking for.
I would like to compare 2 similar delimited files, but they will have different number of rows and some duplicates. The files will have identical headers.
file1.txt
mem_id date time building
aa1 bb1 cc1 dd1
aa2 bb2 cc2 dd2
aa3 bb3 ccx3 dd3
aa4 bb4 cc4 dd4
aa5 bb5 cc5 dd5
file2.txt
mem_id date time building
aa1 bby1 cc1 ddy1
aa2 bb2 cc2 dd2
aa3 bb3 cc3 dd3
aa4 bb4 cc4 dd4
aa4 bb4a cc4a dd4a
You will see there are 4 differences:
1- File2, mem_id aa1 has a “y” in both the "date" and "building" column
2- File1, mem_id aa3 has an “x” in "time" column
3- File1, has a mem_id aa5
4- File2, mem_id aa4 has 2 entries
I would like to run a script to output only the differences between the 2 files (skipping identical lines). Everything I have tried gets hung-up on the duplicate or skipped lines, thus messing up output throughout the file. If all lines match, the following code works well:
current_code
awk -F ',' 'BEGIN {IGNORECASE = 1} NR==1 {for (i=1; i<=NF; i++) header[i] = $i}NR==FNR {for (i=1; i<=NF; i++) {A[i,NR] = $i} next}{ for (i=1; i<=NF; i++) if (A[i,FNR] != $i) print header[1]"#-"$1": " header[i] "- " ARGV[1] " value= ", A[i,FNR]" / " ARGV[2] " value= "$i}'
desired_output.txt
Mem_id#-aa1 : date- file1.txt value = bb1 / file2.txt value= bby1
Mem_id#-aa1 : building- file1.txt value = dd1 / file2.txt value= ddy1
Mem_id#-aa3 : time- file1.txt value = ccx3 / file2.txt value= dd3
Mem_id#-aa4 : date- file1.txt value = / file2.txt value= bb4a
Mem_id#-aa4 : time- file1.txt value = / file2.txt value= cc4a
Mem_id#-aa4 : building- file1.txt value = / file2.txt value= dd4a
Mem_id#-aa5 : date- file1.txt value = bb5 / file2.txt value=
Mem_id#-aa5 : time- file1.txt value = cc5 / file2.txt value=
Mem_id#-aa5 : building- file1.txt value = dd5 / file2.txt value=
Best Answer
The following python program should do what you want, or something very close to it.
In the
desired_output.txt
the 3rd line seems to be erroneous:the
dd3 should probably be
cc3`Apart from that the output from the program matches except for whitespace, which seems a bit irregular in your sample output.
The input is considered to be ordered by key (memid)
The sample input is a bit restricted on what behaviour is expected when the first and the second line have the same memid twice (or more).
In
output()
I try to match any rows and pop all matching (both from left and right). Therefore the order of matching lines within the same memid is not important. If left or right or both are empty afterwards printing is easy (especially when both are empty). For the rest the I match each remaining line from the left to the right.The
fmt
string inline_out()
determines the output, you can freely change/reorder that.