Bash – merge duplicate rows in columns

awkbashperltext processing

Given a file like this:

x y y z x
x x y z z y
x x x z y
y z z y x x x
x x x x x

I would like the output to be:

x y+ z x
x+ y z+ y
x+ z y
y z+ y x+
x+

Is that possible to do with awk or perl in a oneliner? i.e. is it possible to find any number of similar values in rows and merge them?

Best Answer

This perl version can also deal with arbitrary field lengths, not only those of a single character:

$ perl -lpae 'for $i (@F){s/($i\s*){2,}/$i+ /g}' file 
x y+ z x
x+ y z+ y
x+ z y
y z+ y x+ 
x+

On a more complex file:

$ cat file
foo foo foo bar foo
bar foo bar bar foo
foo foo x x x bar
$ perl -lpae 'for $i (@F){s/($i\s*){2,}/$i+ /g}' file 
foo+ bar foo
bar foo bar+ foo
foo+ x+ bar

Explanation

The -l trims newlines from each input line, the -a splits input fields on whitespace into the array @F and the -p prints each input line after applying the script given by -e.

The script itself iterates over each input field (the @F array), saving each as $i. The substitution looks for 2 or more consecutive $i followed by 0 or more spaces and replaces them with $i+.

Related Solutions

Text Processing AWK Perl – How to Average Rows with Same First Column

Using awk :

The input file

Awk in a shell :

$ awk '
    NR>1{
        arr[$1]   += $2
        count[$1] += 1
    }
    END{
        for (a in arr) {
            print "id avg " a " = " arr[a] / count[a]
        }
    }
' FILE

Or with Perl in a shell :

$ perl -lane '
    END {
        foreach my $key (keys(%hash)) {
            print "id avg $key = " . $hash{$key} / $count{$key};
        }
    }
    if ($. > 1) {
        $hash{$F[0]}  += $F[1];
        $count{$F[0]} += 1;
    }
' FILE

Output is :

id avg 601 = 24
id avg 510 = 64.4

And last for the joke, a Perl dark-obfuscated one-liner =)

perl -lane'END{for(keys(%h)){print"$_:".$h{$_}/$c{$_}}}($.>1)&&do{$h{$F[0]}+=$F[1];$c{$F[0]}++}' FILE

UNIX paste columns and insert zeros for all missing values

If column-order is important, i.e. numbers from the same file should be kept in the same column, you need to add padding while reading the different files. Here is one way that works with GNU awk:

merge.awk

# Set k to be a shorthand for the key
{ k = $1 SUBSEP $2 }

# First element with this key, add zeros to align it with other rows
!(k in h) {
  for(i=1; i<=ARGIND-1; i++)
    h[k] = h[k] OFS 0 
}

# Remember the data element
{ h[k] = h[k] OFS $3 }

# Before moving to the next file, ensure that all rows are aligned
ENDFILE {
  for(k in h) {
    if(split(h[k], a) < ARGIND)
      h[k] = h[k] OFS 0
  }
}

# Print out the collected data
END {
  for(k in h) {
    split(k, a, SUBSEP)
    print a[1], a[2], h[k]
  }
}

Here are some test files: f1, f2, f3 and f4:

$ tail -n+1 f[1-4]
==> f1 <==
xyz desc1 21
uvw desc2 22
pqr desc3 23

==> f2 <==
xyz desc1 56
uvw desc2 57

==> f3 <==
xyz desc1 87
uvw desc2 88

==> f4 <==
xyz desc1 11
uvw desc2 12
pqr desc3 13
stw desc1 14
arg desc2 15

Test 1

awk -f merge.awk f[1-4] | column -t

Output:

pqr  desc3  23  0   0   13
uvw  desc2  22  57  88  12
stw  desc1  0   0   0   14
arg  desc2  0   0   0   15
xyz  desc1  21  56  87  11

Test 2

awk -f merge.awk f2 f3 f4 f1 | column -t