I have a big text file(with 5m lines) in this format(4 columns, separated by ;
):
string1; string2; string3; userId
The first 3 strings (SHA1s) form a single ID, called appId (so it can be simlified like this: appId; userId
). The second column (string2, or second part of appId) itself may be composed of some parts separated by comma ,
. The file is sorted.
I would like to have the list of users of each app in front of it, like this:
input file:
app1, user1
app1, user2
app1, user3
app2, user1
output file:
app1: user1, user2, user3
app2: user1
part of "real" input file:
44a934ca4052b34e70f9cb03f3399c6f065becd0;bf038823f9633d25034220b9f10b68dd8c16d867;309;8ead5b3e0af5b948a6b09916bd271f18fe2678aa
44a934ca4052b34e70f9cb03f3399c6f065becd0;bf038823f9633d25034220b9f10b68dd8c16d867;309;a21245497cd0520818f8b14d6e405040f2fa8bc0
5c3eb56d91a77d6ee5217009732ff421e378f298;200000000000000001000000200000,6fd299187a5c347fe7eaab516aca72295faac2ad,e25ba62bbd53a72beb39619f309a06386dd381d035de372c85d70176c339d6f4;16;337556fc485cd094684a72ed01536030bdfae5bb
5c3eb56d91a77d6ee5217009732ff421e378f298;200000000000000001000000200000,6fd299187a5c347fe7eaab516aca72295faac2ad,e25ba62bbd53a72beb39619f309a06386dd381d035de372c85d70176c339d6f4;16;382f3aaa9a0347d3af9b35642d09421f9221ef7d
5c3eb56d91a77d6ee5217009732ff421e378f298;200000000000000001000000200000,6fd299187a5c347fe7eaab516aca72295faac2ad,e25ba62bbd53a72beb39619f309a06386dd381d035de372c85d70176c339d6f4;16;396529e08c6f8a98a327ee28c38baaf5e7846d14
The "real" output file should look like this:
44a934ca4052b34e70f9cb03f3399c6f065becd0;bf038823f9633d25034220b9f10b68dd8c16d867;309:8ead5b3e0af5b948a6b09916bd271f18fe2678aa, a21245497cd0520818f8b14d6e405040f2fa8bc0
5c3eb56d91a77d6ee5217009732ff421e378f298;200000000000000001000000200000,6fd299187a5c347fe7eaab516aca72295faac2ad,e25ba62bbd53a72beb39619f309a06386dd381d035de372c85d70176c339d6f4;16:337556fc485cd094684a72ed01536030bdfae5bb, 382f3aaa9a0347d3af9b35642d09421f9221ef7d, 396529e08c6f8a98a327ee28c38baaf5e7846d14
How can I do this?
Edit: Also, there can be thousands of users per app, so how long can a line be? Is there any limitation for line length?
Best Answer
In Perl
You should be able to do something similar in
awk
using associative arrays, but I'm not really that well-versed inawk
so I can't contribute actual code.Explanation
Here's an expanded version of the above code that uses as little "magic" as possible:
Now there is only the
push
statement left to explain in more detail:push
is usually used to add elements to an array variable. For example:appends the contents of the variable
$x
to the array@a
.The loop that reads the file line-by-line is filling in a hash table (
%h
). The keys to the hash are the AppIDs and the value that corresponds to each key is an array containing all the user IDs associated with that AppID. This is an anonymous array (it has no name); in Perl this is implemented as an array reference (somewhat similar to C pointers). And since the value of%h
that corresponds to the AppID$app_id
is denoted by$h{$app_id}
, tacking on the Perl array sigial (@
) treats the hash value as an array (de-references the array reference) and pushes the current user ID onto it.An alternative that may feel less "Perlish" to you would be to concatenate the 4th field to the current value:
where the
.
in Perl is the string concatenation operator.Note that in the explanation code, I have omitted the
perl -e '...'
wrapper so the syntax highlighting can get to the code and make it more readable.