UNIX paste columns and insert zeros for all missing values

awkpastetext processing

I would like to merge specific columns from two txt files containing varying number of rows, but same number of columns (as shown below):

  file1:
  xyz   desc1   12
  uvw   desc2   55
  pqr   desc3   12

  file2:
  xyz   desc1   56
  uvw   desc2   88


  Preferred output:
  xyz   desc1   12  56
  uvw   desc2   55  88
  pqr   desc3   12   0

Currently I use the paste command using awk as:

  paste <(awk '{print $1}' file1) <(awk '{print $2}' file1) <(awk '{print $3}' file1) <(awk '{print $3}' file2)

But this seems to merge only columns that overlap. Is there a way in awk to insert zeros instead of omitting the row itself?

I need to combine 100 files together such that my output file will contain 102 columns.

Best Answer

If column-order is important, i.e. numbers from the same file should be kept in the same column, you need to add padding while reading the different files. Here is one way that works with GNU awk:

merge.awk

# Set k to be a shorthand for the key
{ k = $1 SUBSEP $2 }

# First element with this key, add zeros to align it with other rows
!(k in h) {
  for(i=1; i<=ARGIND-1; i++)
    h[k] = h[k] OFS 0 
}

# Remember the data element
{ h[k] = h[k] OFS $3 }

# Before moving to the next file, ensure that all rows are aligned
ENDFILE {
  for(k in h) {
    if(split(h[k], a) < ARGIND)
      h[k] = h[k] OFS 0
  }
}

# Print out the collected data
END {
  for(k in h) {
    split(k, a, SUBSEP)
    print a[1], a[2], h[k]
  }
}

Here are some test files: f1, f2, f3 and f4:

$ tail -n+1 f[1-4]
==> f1 <==
xyz desc1 21
uvw desc2 22
pqr desc3 23

==> f2 <==
xyz desc1 56
uvw desc2 57

==> f3 <==
xyz desc1 87
uvw desc2 88

==> f4 <==
xyz desc1 11
uvw desc2 12
pqr desc3 13
stw desc1 14
arg desc2 15

Test 1

awk -f merge.awk f[1-4] | column -t

Output:

pqr  desc3  23  0   0   13
uvw  desc2  22  57  88  12
stw  desc1  0   0   0   14
arg  desc2  0   0   0   15
xyz  desc1  21  56  87  11

Test 2

awk -f merge.awk f2 f3 f4 f1 | column -t

Output:

pqr  desc3  0   0   13  23
uvw  desc2  57  88  12  22
stw  desc1  0   0   14  0
arg  desc2  0   0   15  0
xyz  desc1  56  87  11  21

Edit:

If the output should be tab-separated, set the output field separator accordingly:

awk -f merge.awk OFS='\t' f[1-4]

Related Solutions

Merging 2 files with based on field match

$ awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' file2 file1
aa 45 32
bb 31 15
cc 50 78

Explanation:

awk implicitly loops through each file, one line at a time. Since we gave it file2 as the first argument, it is read first. file1 is read second.

FNR==NR{a[$1]=$2;next}

NR is the number of lines that awk has read so far and FNR is the number of lines that awk has read so far from the current file. Thus, if FNR==NR, we are still reading the first named file: file2. For every line in file2, we assign a[$1]=$2.

Here, a is an associative array and a[$1]=$2 means saving file2's second column, denoted $2, as a value in array a using file2's first column, $1, as the key.

next tells awk to skip the rest of the commands and start over with the next line.
($1 in a) {print $1,a[$1],$2}

If we get here, that means that we are reading the second file: file1. If we saw the first field of the line in file2, as determined by the contents of array a, then we print out a line with the values of field 2 from both files.

Adjust gap between 2 columns to make them look straight

awk 'FNR==1{f+=1;w++;}
     f==1{if(length>w) w=length; next;}
     f==2{printf("%-"w"s",$0); getline<f2; print;}
    ' f2=file2 file1 file1

Note: file1 is quite intentionally read twice; the first time is to find the maximum line length, and the second time is to format each line for the final concatenation with corresponding lines from file2. — file2 is read programatically; its name is provided by awk's variable-as-an-arg feature.

Output:

hi             1
wonderful      2
amazing        3
sorry          4
superman       5
superhumanwith 6
loss           7

To handle any number of input files, the following works.but *Note: it does not cope with repeating the same filename. ie each filename arg refers to a different file. It can, however, handle files of different lengths - beyond a files EOF, spaces are used.

awk 'BEGIN{ for(i=1; i<ARGC; i++) { 
              while( (getline<ARGV[i])>0) { 
                 nl[i]++; if(length>w[i]) w[i]=length; }
              w[i]++; close(ARGV[i])
              if(nl[i]>nr) nr=nl[i]; }
            for(r=1; r<=nr; r++) {
              for(f=1; f<ARGC; f++) {
                if(r<=nl[f]) getline<ARGV[f]; else $0=""  
                printf("%-"w[f]"s",$0); } 
              print "" } }
    ' file1 file2 file3 file4

Here is the output with 4 input files:

hi             1 cat   A 
wonderful      2 hat   B 
amazing        3 mat   C 
sorry          4 moose D 
superman       5       E 
superhumanwith 6       F 
loss           7       G 
                       H

Best Answer

Test 1

Test 2

Edit:

Related Solutions

Merging 2 files with based on field match

Adjust gap between 2 columns to make them look straight

Related Question