According to what you found out yourself, I think I can provide a detailed explanation.
How chunk migration works (simplified)
Here is how chunk migration works (simplified for the sake of shortness):
- When chunk exceeds the configured chunkSize (64MB by default), the mongos which caused the size increase will split the chunk.
- That chunk is split on the shard the original chunk resided on. Since the config servers do not form a replica set, it has to update all three config servers.
- If the difference in the number of chunks reaches the chunk migration threshold, the cluster balancer will initiate the migration process.
- The balancer will try to acquire a balancing lock. Note that because of this there can be only one chunk migration at any given point in time. (This will become important in the explanation what happened.)
- In order to acquire the balancing lock, the balancer needs to update the config on all three config servers.
- If the balancing lock is acquired, the data will be copied over to the shard with the least chunks.
- After the data is successfully copied to the destination, the balancer updates the key range to shard mapping on all three config servers
- As a last step, the balancer notifies the source shard to delete the migrated chunk.
What happened in your case
Your data was increasing to a point where chunks had to be split. At this point, all three of your config servers were available and the metadata was updated accordingly. Chunks splitting is a relatively cheap operation in comparison to chunk migration and may happen often, so under load, you usually have a lot more chunk splits than migrations. As said before, only one chunk migration can happen at any given point in time.
Due to whatever problems, one or more of your config servers became unreachable after the chunks were split, but before enough chunks were migrated to balance them out below the chunk migration threshold, since there can only be one running chunk migration at any given point in time. Bottom line: Some time before every chunk that needed to be migrated was actually migrated, one or more of your config servers became unavailable.
Now the balancer wanted to migrate a chunk, but could not reach all config servers to acquire the global migration lock. Hence the error message.
How to deal with such situation
It is very likely that your config server were out of sync. In order to deal with this situation, please read Adam Comerfords answer to mongodb config servers not in sync. Follow it to the letter.
How to prevent
Plain and simple: use MMS. It is free, gives a lot of information about health and performance and using the automation agent, administration of a MongoDB cluster can be done a lot faster.
I tend to suggest to install at least 3 monitoring agents, so you can have a scheduled downtime on one while maintaining redundancy with the others.
MMS has alerting capabilities, so you will be notified if one of your config servers becomes unavailable, which is a serious situation.
Best Answer
If you are creating a new sharded collection and want to minimise the time to re-balance a large amount of data, the recommended approach would be to pre-split chunks in an empty collection before inserting the data. In your example, you would have pre-split based on the data distribution so each shard would have an equal number of chunks representing about 1/4 of the data.
Assuming you have chosen a suitable shard key for data distribution going forward, there should not be a significant number of regular chunk migrations unless you are adding or removing shards.
If your shards are backed by replica sets you can make the balancer more aggressive by disabling
_secondaryThrottle
, which reduces the write concern used for documents migrating between shards. By default,_secondaryThrottle
istrue
which is equivalent to a{w:2}
write concern: each document move during chunk migration propagates to at least one secondary before the balancer proceeds with the next document. In MongoDB 3.0+, there is also an option to configure an explicitwriteConcern
for thesecondaryThrottle
operation.For example, to disable
_secondaryThrottle
from amongos
shell:Balancer setting updates will not take immediate effect if there is currently a balance round / migration in progress, so you may want to disable & re-enable the balancer.
For normal usage the default
_secondaryThrottle
behaviour is recommended to ensure documents have propagated and reduce the impact of balancing. If your replica sets have more than 3 members, you could also consider increasing thewriteConcern
tomajority
(see: Write Concerns for Replica Sets).