MongoDB: Aggregation workaround by helper collection

aggregatemongodbnosql

Since I regularly run against mongodb's document size limit and also aggregation framework's group limitations >16M, depending on the amount of data I need to process, I look for a simple workaround.

The solution must also fit for single nodes (and must not require multiple mongos).

Saying in few words, it's an aggregation for finding

(a) the number (count) of uniques in a field and group the number by another field

(b) the frequency or count how often a unique id appeared.

An example could make the task easier to understand: Lets say there is a collection with 2 fields, the last and surname.

(ex a) How many different surnames exist for each last name? Result could be: There are 1000 unique (distinct) surnames for last name "smith", 500 different surnames for people with last name "schulz" etc.

=> this could also be solved with distinct count and/or mapreduce etc. no limitation is exceeded. anyways, it seems to be a slow table scan and aggregation framework does the task much faster.

(ex b) How popular are surnames for each last name? Result: With the last name "smith" there are different 500 sur names which appear only once, 200 sur names which appear 2 times, 45 are listed 3 times our data, …

=> here it is very comfortable to work with mongodb's aggregation framework which handles this task fast and properly. However, trying the task with lots of uniques in a 15GB collection, it worries about the 16M limitation.

The first workaround was successfull but is only a quick'n'dirt non-scaleable trick:
Instead of processing all the data and $group afterwards, I filter ($match) for each $group-item. With the example from above, I look up all distinct last names, run one aggregation per lastname and reduce the pipeline data before aggregation starts.

This avoids the 16M limit issue if the amount of data for a single $group-item (last name) is not too much. => only quick'n'dirty workaround.

Another idea is a help collection for the COUNTs.

I could iterate each input document and write into another collection the surnames and one field per lastname with the counter.

E.g. surname=John count_with_Lastname.smith=10, count_with_Lastname.brown=2, …

Building this helper collection is very slow, isn't it? When incrementing +1 which is 1 update/write per input document, I fear this could take forever 🙁

Is there a best-practice or tips for solving this task (withouth multiple nodes and without Hadoop or alternatives)?

Please let me know

Best Answer

Once you go beyond the limitations you mention (until $out is supported at least), then your best option is Map Reduce. If you are going to run multiple Map Reduce jobs, I would recommend version 2.4 at least (uses V8 engine, muti-threaded). For an example of how to do a unique count with Map Reduce, take a look here:

http://cookbook.mongodb.org/patterns/unique_items_map_reduce/