Mongodb – Running Aggregation logic on multiple shards of MongoDB

mongodb

Suppose we have 5 shards in MongoDB having collection of data, and I have to write an Aggregation logic which should work on each of 5 shards in my cluster and collect data from these shards. Should it be taken care at the application developement side? like connecting to each Shard separatley by its shard key and get data Or once I write the aggregation logic and deploy my jar on this cluster it will be handled my MongoDB itself to read from these shards and work aggregation logic on these shards data?

Like in Cassandra MapReduce it will be handled by a Job tracker to send job to appropriate nodes.

Best Answer

You don't ever need to connect to specific shards in a MongoDB database. Instead, you connect to a mongos instance that handles the routing for you.

In your case, you would connect to the mongos instance normally, by typing mongo into the terminal, or through a language specific client. You send your aggregation operation to the mongos instance, and it will distribute the operations to each of the shards and combine the result at the end.

Sharding is, in many ways, transparent to the user (though certainly not transparent to the database architect): many queries that run on a non-sharded mongod instance will run in the same manner on a mongos instance.

Related Solutions

Mongodb – How to remove a corrupted shard in MongoDB

You can use below commands to remove the shards, though I havn't tried ever but looks straight forward.

use admin
db.runCommand( { removeShard: "mongodb0" } )

--Response
{
    "msg" : "draining started successfully",
    "state" : "started",
    "shard" : "mongodb0",
    "ok" : 1
}

link: http://docs.mongodb.org/manual/tutorial/remove-shards-from-cluster

MongoDB Sharding – Limiting Number of Documents to Migrate

For your 7 collections on the primary shard:

Enter as administrator on the primary replica (if you shard your replica) and create your collection there by inserting 1 document or creating an index there. When you create a collection via the mongos - shard client - then the collection is started on a random shard; if you create it on the shard itself first, then you know it's on the one you want. shard1$ mongo --port 27018 localhost/mydb --eval 'db.mycol.insert({firstdocument:'hello'})'
For the 25 collections:

Standard sharding indeed as you mention. Be patient with the distribution. The first shards will start distributing after a certain amount of data, can be 100k documents if the documents are small.
For the big collection:

Have a look at shard-tags. So, you have to tag your collection on shard1 with let's say : {status:'in_use'}; shard2 is tagged {status:'freeBeer'}. Then depending on the status in the document, the shard is chosen.
For the disk issue, I think you gonna have to write a script. Get a warning when 60% of a disk is used, and another warning at 80% to take action. Then you'll need to upgrade your disk or redistribute the chunks. Certainly the 2TB collection could give you space via the tag, to send more to shard2. Or, add a shard3 ... what is actually the purpose of sharding, to grow horizontally.
Choose shardKeys well, practice configuration and setup on a small server, practice moving chunks, adding tags, ....
Have a look at the AllChunkInfo script of Adam Comerford for you scripts and testing : https://github.com/comerford/mongodb-scripts

Best Answer

Related Solutions

Mongodb – How to remove a corrupted shard in MongoDB

MongoDB Sharding – Limiting Number of Documents to Migrate

Related Question