Linux – SQL like group by and sum for text files in command line

awkbashcommand linelinuxsed

I have huge text files with two fields, the first is a string the second is an integer. The files are sorted by the first field. What I'd like to get in the output is one line per unique string and the sum of the numbers for the identical strings. Some strings appear only once while other appear multiple times.
E.g. Given the sample data below, for the string glehnia I'd like to get 10+22=32 in the result.

Any suggestions how to do this either with gnuwin32 command line tools or in linux shell?

Thanks!

glehnia 10
glehnia 22
glehniae 343
glehnii 923
glei 1171
glei 2283
glei 3466
gleib 914
gleiber 652
gleiberg 495
gleiberg 709

Best Answer

In AWK, you could do something like this:

awk '($1 == last) || (last == "") {sum += $2}
     ($1 != last) && (last != "") {print last " " sum; sum = $2}
                                  {last = $1}
     END                          {print last " " sum}' huge_text_file.txt

Related Solutions

Joining text files with 600M+ lines

IMO the best way to do this would be to use the programming/scripting language you know best and:

load small.txt into an in-memory hash/map/associative array keyed on the words
Process huge.txt line by line, adding the column looked up from the hash and writing the result into an output file
Buffer input and output so that it happens in chunks of at least 4K

Remove every line breaks following carriage return (^M) and join the lines

If these ^M-s are indeed line break characters, not literal caret & letter M strings, then they are what we denote \r, CR or 0x0d (compare this answer of mine, the beginning of it).

Your command

sed -e "s/^M//"

doesn't remove \r; it doesn't even remove literal ^M. The command means "take a line, search for a letter M that is at the very beginning of the line (^, see this), replace it with nothing.

Note sed understands \r. Still sed -e 's/\r//' is not exactly what you need. It removes \r but you need to remove the following \n as well. You may want to try sed -e 's/\r\n//', this will also fail. The problem is sed is a text tool and it treats \n as a separator. Excerpt from info sed (emphasis mine):

sed operates by performing the following cycle on each lines of input: first, sed reads one line from the input stream, removes any trailing newline, and places it in the pattern space. Then commands are executed; […].

This means normally \n doesn't belong to any string processed with s/… (or another sed command). For this reason concatenating few lines is not easy. Still it can be done. This is the command you need:

sed -e ': start; /\r$/{ s/\r$//; N; s/\n// }; /\r$/b start'

Explanation:

: start is a label.
If the line contains \r (i.e. ^M, 0x0d character) at the very end ($), execute the {} block which is:
- replace \r at the very end with nothing,
- append an additional line from the input (N),
- replace \n that separates the additional line from the previous data.
If the outcome contains \r at the very end (meaning the additional line brought it, so we need to add yet another line), jump to start.

Best Answer

Related Solutions

Joining text files with 600M+ lines

Remove every line breaks following carriage return (^M) and join the lines

Related Question