Bash – Aggregate and group text file in perl or bash

awkbashcut

I have a big text file(with 5m lines) in this format(4 columns, separated by ;):

string1; string2; string3; userId

The first 3 strings (SHA1s) form a single ID, called appId (so it can be simlified like this: appId; userId). The second column (string2, or second part of appId) itself may be composed of some parts separated by comma ,. The file is sorted.

I would like to have the list of users of each app in front of it, like this:

input file:

app1, user1
app1, user2
app1, user3
app2, user1

output file:

app1: user1, user2, user3
app2: user1

part of "real" input file:

44a934ca4052b34e70f9cb03f3399c6f065becd0;bf038823f9633d25034220b9f10b68dd8c16d867;309;8ead5b3e0af5b948a6b09916bd271f18fe2678aa
44a934ca4052b34e70f9cb03f3399c6f065becd0;bf038823f9633d25034220b9f10b68dd8c16d867;309;a21245497cd0520818f8b14d6e405040f2fa8bc0
5c3eb56d91a77d6ee5217009732ff421e378f298;200000000000000001000000200000,6fd299187a5c347fe7eaab516aca72295faac2ad,e25ba62bbd53a72beb39619f309a06386dd381d035de372c85d70176c339d6f4;16;337556fc485cd094684a72ed01536030bdfae5bb
5c3eb56d91a77d6ee5217009732ff421e378f298;200000000000000001000000200000,6fd299187a5c347fe7eaab516aca72295faac2ad,e25ba62bbd53a72beb39619f309a06386dd381d035de372c85d70176c339d6f4;16;382f3aaa9a0347d3af9b35642d09421f9221ef7d
5c3eb56d91a77d6ee5217009732ff421e378f298;200000000000000001000000200000,6fd299187a5c347fe7eaab516aca72295faac2ad,e25ba62bbd53a72beb39619f309a06386dd381d035de372c85d70176c339d6f4;16;396529e08c6f8a98a327ee28c38baaf5e7846d14

The "real" output file should look like this:

44a934ca4052b34e70f9cb03f3399c6f065becd0;bf038823f9633d25034220b9f10b68dd8c16d867;309:8ead5b3e0af5b948a6b09916bd271f18fe2678aa, a21245497cd0520818f8b14d6e405040f2fa8bc0
5c3eb56d91a77d6ee5217009732ff421e378f298;200000000000000001000000200000,6fd299187a5c347fe7eaab516aca72295faac2ad,e25ba62bbd53a72beb39619f309a06386dd381d035de372c85d70176c339d6f4;16:337556fc485cd094684a72ed01536030bdfae5bb, 382f3aaa9a0347d3af9b35642d09421f9221ef7d, 396529e08c6f8a98a327ee28c38baaf5e7846d14

How can I do this?


Edit: Also, there can be thousands of users per app, so how long can a line be? Is there any limitation for line length?

Best Answer

In Perl

perl -F';' -lane 'push @{$h{join ";",@F[0..2]}},$F[3];
                  END{
                    for(sort keys %h){
                        print "$_: ". join ",",@{$h{$_}};
                    }
                  }' your_file

You should be able to do something similar in awk using associative arrays, but I'm not really that well-versed in awk so I can't contribute actual code.

Explanation

Here's an expanded version of the above code that uses as little "magic" as possible:

open($FH,"<","your_file");
while($line=<$FH>){ # For each line in the file (accomplished by -n)
    chomp $line; # Remove the newline at the end (done by -l)
    # The ; is set by -F and storing the split in @F done by -a
    @F = split /;/,$line # Split the line into fields on ;
    $app_id = join ";",@F[0..2]; # AppID is the first 3 fields
    push @{$h{$app_id}},$F[3]; # The 4th field is added onto the hash
} # The whole file has been read at this point.
foreach $key (sort keys %h){ # Sort the hash by AppID
     print "$key: " . join ",",@{h{$key}}."\n"; # Print the array values
     # The newline ("\n") added at the end is also done by -l
}

Now there is only the push statement left to explain in more detail:

  • push is usually used to add elements to an array variable. For example:

    push @a,$x
    

    appends the contents of the variable $x to the array @a.

  • The loop that reads the file line-by-line is filling in a hash table (%h). The keys to the hash are the AppIDs and the value that corresponds to each key is an array containing all the user IDs associated with that AppID. This is an anonymous array (it has no name); in Perl this is implemented as an array reference (somewhat similar to C pointers). And since the value of %h that corresponds to the AppID $app_id is denoted by $h{$app_id}, tacking on the Perl array sigial (@) treats the hash value as an array (de-references the array reference) and pushes the current user ID onto it.

  • An alternative that may feel less "Perlish" to you would be to concatenate the 4th field to the current value:

    while(...) { ... $h{$app_id} = $h{$app_id} . ",$F[3]" }
    foreach $key (sort keys %h) { print "$_: $h{$_}" }
    

    where the . in Perl is the string concatenation operator.

Note that in the explanation code, I have omitted the perl -e '...' wrapper so the syntax highlighting can get to the code and make it more readable.

Related Question