Mongodb – Should I deploy Mongodb sharding for 50 collections

mongodb

I have a 10 node clusters that runs about 50 concurrent jobs. Each job needs to read/write a separate collection. So I have about 50 collections roughly. Each collection has about 20 M records. Most of the time, jobs only need to do sequential read/write.

For simplicity, I could deploy a single instance of mongodb that has no replication, no sharding. And have 50 separate collections. But the single node where mogodb is running becomes a hotspot and the rest 9 nodes can't share the read/write load.

So I would like to leverage the resource and balance the load. But I guess a 20M record collection is not worth sharding? Especially I only need to do sequential read/write. I thought about merging 50 collections into one big collection. But I am stumbled on the document size limitation which is 16 MB.

Any suggestions? Thanks,

Best Answer

You could use a sharded cluster to distribute the collections more evenly across your available hosts. In general I would recommend not going beyond 5 shards (with at least 2 data bearing nodes for redundancy, plus an arbiter to break ties).

Once you have your 5 shards (or however many you end up with), you can then "pin" the collections to a particular shard with tagging. I could explain that process in detail, but as usual, Kristina Chodorow has beaten me to it:

http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/

Although it is not required, and you can have a single database if you wish, I would suggest having more than a single database in order to benefit as much as possible. Locking is at the database level, so concurrent writes will still be competing if you have more than one collection on a shard in the same database.

In fact, there is a simpler solution to this than tagging (though tags are the most scalable solution). If you create separate databases for all of your collections (or at least a subset), you can have each database reside on a different shard and not shard the collections at all.

To explain: without enabling sharded collections, all data for a given database will reside on the "primary" shard for that database (primary shards are designated in a round-robin fashion). Hence if you create 5 databases on a 5 shard cluster (for example) each will have a different shard for its primary and you will have achieved a crude (but effective) distribution of load by using separate databases. I added more detail regarding this in a previous answer.