Linux – How to combine the information of each pair of rows in one row

linuxtext processing

I have data like this (the real data has over 50,000 digits and 8000 rows):

input:

1 11122
1 21121
2 22221
2 11122
3 21121
3 11122

I want to put the value of each second row beside the value of the first row with the same name. Also, there should be two space as deliminator between each pair of values and there should be one tab as deliminator among different pair of values. The output should look like:

output:

1   1  2    1  1    1  1    2  2    2  1
2   2  1    2  1    2  1    2  2    1  2
3   2  1    1  1    1  1    2  2    1  2

any suggestion?

Best Answer

I'd use perl, and run it as oneliner like this:

perl -wne 'sub parseline { ($id,$v) = split; return split //,$v };
    @a = parseline();
    print "$id\t";
    $_ = <>;
    @b = parseline();
    for ($i=0; $i<@a; $i++) {
      print "$a[$i]  $b[$i]\t"
    };
    print "\n"' < input  > output

Explanation:

  • perl -wne runs the rest of command for each line of input
  • sub parseline { .... } will parse input, and set first number in line as $id, and return the rest as array of characters.
  • @a=parseline() will store first line chars in array @a
  • next, we print $id, followed by TAB (\t)
  • $_=<>; @b=parseline(); will read next (even) line and put it's data in array @b
  • for ($i=0; $i<@a; $i++) { print "$a[$i] $b[$i]\t" } for each element of the array @a, we will print that element, two spaces, corresponding element from array @b and then tab
  • print "\n" will print newline at the end
  • due to -n parameter to perl at the start, whole process will restart with line 3, then 5, then 7 etc.
  • < input > output indicates from which file we read our input, and to which file we write output.

Note: the code will print extra tab at the end of each line. Removing it is left as an exercise for the reader to prevent crowdsourced homework assignments and keep code little simpler. Also the code assumes that lines to pair are always two and one after another (as given in example)

As it processes input file line by line, it easily scales linearly for many thousands of lines...

Related Question