Mongodb – Improving concurrency in MongoDB via sharding

amazon ec2awsdatabase-designmongodb

Our current setup:

  • Cluster of ~500 ec2 c1.xLarge worker nodes with 8 threads on each running scientific simulations

  • Results temporarily stored at runtime on each worker node's local MongoDB, using ephemeral data storage.

  • Once a worker node finishes it's simulation jobs, it will stream the results to a master node (m2.2xlarge) which has a MongoDB instance using EBS Raid-0 for storage. Writing on the fly directly to the master DB was causing socket errors.

  • Data across the cluster is not necessarily independent, so the incoming data is coming in via upserts and $push, not inserts.

  • The worker node is terminated once it's data has been received by the master.

  • A typical run will generate ~100GB of data on the master, ~200,000,000 records.

Our problem is that we encounter severe write latency, causing our worker nodes to sit idly waiting on the master node write locks and page faults. We have been looking into setting up a sharded environment, and are looking for suggestions on how best to configure this for our use case.

Increasing our available memory via sharding seems like a logical step, but leaving a cluster of ec2 instances up can be significantly more costly than a single machine. It would be nice if we could use a sharded cluster at runtime, then replicate the results to a persistent single instance for durability.

EDIT: For a given run, our end goal is to have all results in a single collection on the monolithic mongo instance.

Best Answer

You are correct, with MongoDB the way to engineer around a write contention issue is to shard.

Your environment sounds like it's fairly bursty, in that you're not continually ingesting data and instead ingesting it in fairly discrete chunks. With this in mind, you could go with a collector/distributor model such as this:

     ------>{ workers }<------
     |                       |
[Shard01]               [Shard02]
    |                      |
    ---->[Persistant]<------

The workers would upsert/push their results into the sharded collection/database, and once the job was completed a batch-process then uses something like db.copyDatabases() to push the result set at the monolithic (and cheaper to run) single instance Mongo. As the copy database process can push all of its updates in one run, it should experience much fewer write-lock problems.

Related Question