MongoDB – sharding an collection with data already in it

mongodbsharding

I'm using Mongo (shell version 2.6.12) in a clustered configuration. Right now I have a few collections sharded, and was looking to shard another collection. This collection already has data in it. Once I run the command to shard the collection and give it it's key, will the cluster take the existing data and spread it across the multiple systems in my cluster, or will the existing data stay put on one server and any new data will then be spread across the other systems?

Best Answer

When you shard an existing collection, MongoDB will split the values of the shard key into chunk ranges based on data size and start rebalancing across shards based on the chunk distribution. Rebalancing a large collection can be very resource intensive so you should consider the timing and impact on your production deployment.

In MongoDB 2.6 (which reached end of life in October, 2016) the balancer will only perform one chunk migration at a time. In MongoDB 3.4 or newer the balancer can perform parallel chunk migrations with the restriction that each shard can participate in at most one migration at a time. As at MongoDB 3.6, that means that a sharded cluster with n shards can perform at most n/2 (rounded down) simultaneous chunk migrations.

A faster approach to sharding a large existing collection would be to pre-split chunks in an empty collection based on the current data distribution for your shard key, and to then dump & restore your existing data into this new sharded collection. The pre-split approach minimizes rebalancing activities by creating appropriate chunk ranges so data is distributed on insert.

There have been significant performance improvements since MongoDB 2.6, so I would strongly recommend upgrading to a supported version of MongoDB (ideally MongoDB 3.4 or newer).