UNIX paste columns and insert zeros for all missing values

awkpastetext processing

I would like to merge specific columns from two txt files containing varying number of rows, but same number of columns (as shown below):

  file1:
  xyz   desc1   12
  uvw   desc2   55
  pqr   desc3   12

  file2:
  xyz   desc1   56
  uvw   desc2   88


  Preferred output:
  xyz   desc1   12  56
  uvw   desc2   55  88
  pqr   desc3   12   0

Currently I use the paste command using awk as:

  paste <(awk '{print $1}' file1) <(awk '{print $2}' file1) <(awk '{print $3}' file1) <(awk '{print $3}' file2) 

But this seems to merge only columns that overlap. Is there a way in awk to insert zeros instead of omitting the row itself?

I need to combine 100 files together such that my output file will contain 102 columns.

Best Answer

If column-order is important, i.e. numbers from the same file should be kept in the same column, you need to add padding while reading the different files. Here is one way that works with GNU awk:

merge.awk

# Set k to be a shorthand for the key
{ k = $1 SUBSEP $2 }

# First element with this key, add zeros to align it with other rows
!(k in h) {
  for(i=1; i<=ARGIND-1; i++)
    h[k] = h[k] OFS 0 
}

# Remember the data element
{ h[k] = h[k] OFS $3 }

# Before moving to the next file, ensure that all rows are aligned
ENDFILE {
  for(k in h) {
    if(split(h[k], a) < ARGIND)
      h[k] = h[k] OFS 0
  }
}

# Print out the collected data
END {
  for(k in h) {
    split(k, a, SUBSEP)
    print a[1], a[2], h[k]
  }
}

Here are some test files: f1, f2, f3 and f4:

$ tail -n+1 f[1-4]
==> f1 <==
xyz desc1 21
uvw desc2 22
pqr desc3 23

==> f2 <==
xyz desc1 56
uvw desc2 57

==> f3 <==
xyz desc1 87
uvw desc2 88

==> f4 <==
xyz desc1 11
uvw desc2 12
pqr desc3 13
stw desc1 14
arg desc2 15

Test 1

awk -f merge.awk f[1-4] | column -t

Output:

pqr  desc3  23  0   0   13
uvw  desc2  22  57  88  12
stw  desc1  0   0   0   14
arg  desc2  0   0   0   15
xyz  desc1  21  56  87  11

Test 2

awk -f merge.awk f2 f3 f4 f1 | column -t

Output:

pqr  desc3  0   0   13  23
uvw  desc2  57  88  12  22
stw  desc1  0   0   14  0
arg  desc2  0   0   15  0
xyz  desc1  56  87  11  21

Edit:

If the output should be tab-separated, set the output field separator accordingly:

awk -f merge.awk OFS='\t' f[1-4]
Related Question