Mongodb – Aggregation vs Cursor For Reshaping MongoDB Documents

mongodb

I'm looking to reshape the Documents in one of my collections, and have found two ways to do it, but need guidance. For simplicity, say I have a collection, "myColl", and I need to reshape Documents that look like this:

{
   x:"foo",
   y:"bar"
}

To:

{
   nest: {
       x: "foo",
       y: "bar"
   }
}

This can be accomplished by using the aggregation framework to reshape the documents, and then rewrite the entire collection. When run against a test collection of about 150K records, the following takes roughly 5 seconds:

db.myColl.aggregate([{$project: {_id:"$id", nest: {x:$x, y:$y}}}, {$out:"myColl"}]);

If I try to do this using a cursor, it takes about a 1.5 minutes:

db.myColl.find().snapshot().forEach(
        function(elem) {
            db.myColl.update(
                {_id: elem._id},
                {$set: {nest: {x: elem.x, y: elem.y}}}
            );
        }
);

I'm leaning towards the aggregation approach for performance reasons; however, someone mentioned here that it is creating a "new" collection with somewhat of a negative connotation, but it's not entirely apparent as to why. Are there causes of concern that I should be aware of other than the Type safety mentioned in that comment?

Also, if the cursor approach is better, then how might I speed up the execution? Setting the "w" param of WriteConcern to 0 doesn't do anything in my test because everything is hosted on the same box so skipping the acknowledgement doesn't save me any time, and is orthogonal to the fact that aggregation is executing order of magnitudes faster.

Thanks for the input!

Best Answer

Aggregation vs Cursor

Let's first start from Aggregation. As per MongoDB BOL Here Aggregation operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. The aggregation pipeline can use indexes to improve its performance during some of its stages. In addition, the aggregation pipeline has an internal optimization phase.

The most basic pipeline stages provide filters that operate like queries and document transformations that modify the form of the output document.

The pipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB.

For example here i want to show the aggregation where MongoDB provides three ways to perform aggregation: the aggregation pipeline, the map-reduce function, and single purpose aggregation methods.

Let's here i am going to create orders collection of MongoDB with 4 documents.

> db.orders.insertMany([
... {cust_id: "A123",
... amount: 500,
... status: "A"
... },
... {cust_id: "A123",
... amount: 250,
... status: "A"
... },
... {cust_id: "B212",
... amount: 200,
... status: "A"
... },
... {cust_id: "A123",
... amount: 300,
... status: "D"
... }
... ]
... )
{
        "acknowledged" : true,
        "insertedIds" : [
                ObjectId("5a44c1479adf6e5fc5cea525"),
                ObjectId("5a44c1479adf6e5fc5cea526"),
                ObjectId("5a44c1479adf6e5fc5cea527"),
                ObjectId("5a44c1479adf6e5fc5cea528")
        ]
}

To verify the inserted documents from MongoDB

> db.orders.find().pretty()
{
        "_id" : ObjectId("5a44c1479adf6e5fc5cea525"),
        "cust_id" : "A123",
        "amount" : 500,
        "status" : "A"
}
{
        "_id" : ObjectId("5a44c1479adf6e5fc5cea526"),
        "cust_id" : "A123",
        "amount" : 250,
        "status" : "A"
}
{
        "_id" : ObjectId("5a44c1479adf6e5fc5cea527"),
        "cust_id" : "B212",
        "amount" : 200,
        "status" : "A"
}
{
        "_id" : ObjectId("5a44c1479adf6e5fc5cea528"),
        "cust_id" : "A123",
        "amount" : 300,
        "status" : "D"
}

Aggregation Pipeline

MongoDB’s aggregation framework is modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result.

> db.orders.aggregate([{$match: {status: "A"}},
... {$group: {_id: "$cust_id",total:{$sum: "$amount"}}}])
{ "_id" : "B212", "total" : 200 }
{ "_id" : "A123", "total" : 750 }
> 

Map-Reduce

MongoDB also provides map-reduce operations to perform aggregation. In general, map-reduce operations have two phases: a map stage that processes each document and emits one or more objects for each input document, and reduce phase that combines the output of the map operation. Optionally, map-reduce can have a finalize stage to make final modifications to the result. Like other aggregation operations, map-reduce can specify a query condition to select the input documents as well as sort and limit the results.

> db.orders.mapReduce(
... function() {emit (this.cust_id,this.amount);},
... function(key,values){return Array.sum(values)},
... {
... query:{status: "A"},
... out: "order_total"
... }
... )
{
        "result" : "order_total",
        "timeMillis" : 1178,
        "counts" : {
                "input" : 3,
                "emit" : 3,
                "reduce" : 1,
                "output" : 2
        },
        "ok" : 1
}
>

Note: Starting in MongoDB 2.4, certain mongo shell functions and properties are inaccessible in map-reduce operations. MongoDB 2.4 also provides support for multiple JavaScript operations to run at the same time. Before MongoDB 2.4, JavaScript code executed in a single thread, raising concurrency issues for map-reduce.

Single Purpose Aggregation Operations

MongoDB also provides db.collection.count() and db.collection.distinct().

All of these operations aggregate documents from a single collection. While these operations provide simple access to common aggregation processes, they lack the flexibility and capabilities of the aggregation pipeline and map-reduce.

> db.orders.distinct("cust_id")
[ "A123", "B212" ]

Cursor

As MongoDB BOL Iterate a Cursor in the mongo Shell The db.collection.find() method returns a cursor. To access the documents, you need to iterate the cursor. However, in the mongo shell, if the returned cursor is not assigned to a variable using the var keyword, then the cursor is automatically iterated up to 20 times to print up to the first 20 documents in the results.

The following examples describe ways to manually iterate the cursor to access the documents or to use the iterator index.

Manually Iterate the Cursor

var myCursor = db.orders.find( { Cust_id: "A123" } );

myCursor

You can use the cursor method forEach() to iterate the cursor and access the documents, as in the following example:

var myCursor =  db.orders.find( { Cust_id: "A123" } );

myCursor.forEach(printjson);

Note : You can use the DBQuery.shellBatchSize to change the number of iteration from the default value 20.

Iterator Index

In the mongo shell, you can use the toArray() method to iterate the cursor and return the documents in an array, as in the following:

var myCursor = db.orders.find( { Cust_id: "A123" } );
var documentArray = myCursor.toArray();
var myDocument = documentArray[3];

The toArray() method loads into RAM all documents returned by the cursor; the toArray() method exhausts the cursor.

Cursor Behaviors

Closure of Inactive Cursors by default, the server will automatically close the cursor after 10 minutes of inactivity, or if client has exhausted the cursor. To override this behavior in the mongo shell, you can use the cursor.noCursorTimeout() method:

var myCursor = db.orders.find().noCursorTimeout();

After setting the noCursorTimeout option, you must either close the cursor manually with cursor.close() or by exhausting the cursor’s results.

To know Cursor Information from MongoDB Server

The db.serverStatus() method returns a document that includes a metrics field.

db.serverStatus().metrics.cursor

The result is the following document:

{
   "timedOut" : <number>
   "open" : {
      "noTimeout" : <number>,
      "pinned" : <number>,
      "total" : <number>
   }
}

As finally, In Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. The pipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB.

where in the mongo shell, if the returned cursor is not assigned to a variable using the var keyword, then the cursor is automatically iterated up to 20 times to print up to the first 20 documents in the results. The toArray() method loads into RAM all documents returned by the cursor; the toArray() method exhausts the cursor.