Mongodb – mongo shard balancer not working

In our production system, we have shard cluster with –

2 shards (each one is replica set of primary + secondary + arbiter)

3 Config servers

4 mongoS instances

A couple of days back, we faced a network outrage in which a secondary failed to connect to its corresponding primary. This issue was resolved in some time but mongo balancer is stuck from that point of time.

All of the data is being moved to a single shard and nothing is transferred. In mongoS logs, we monitored that lock was acquired by a mongoS instance but not released till date –

    configsvr> db.locks.find({state:2}).pretty()
    {
    "_id" : "balancer",
    "process" : "mongosHost3:27017:45654643:-54656",
    "state" : 2,
    "ts" : ObjectId("57c29902a443254e2d8e1b27"),
    "when" : ISODate("2016-08-28T07:55:46.350Z"),
    "who" : "mongosHost3:27017:45654643:-54656:Balancer:37654778",
    "why" : "doing balance round"
    }

We tried restarting this MongoS instance but nothing changes. We tried stopping the balancer but that failed with error –

     Error: Error: assert.soon failed, msg:Waited too long for lock balancer to unlock

How can we resolve this balancer issue?

Is there any way we can manually relinquish this lock and re-enable balancer?

Best Answer

First, restart the mongos process on mongosHost3, then recheck the locks as you did in the question.

If the same lock (make sure it is the same and not a new lock by the same process - check the ts field etc.) is still present after that restart it is safe to remove it manually, it is basically stale and should be cleaned up eventually in any case. You can remove it just like any other document (assuming you have permissions to edit the config DB of course). MongoDB 3.2 introduced the deleteOne command which I would recommend using here.

It is possible that this is not an issue with a stale lock and is instead an issue with balancing (i.e. it is seeing errors or is stuck/slow for some reason), take a look in the mongos logs of whatever process holds the lock once balancing kicks off again and see what it reports as progress for balancing.

You can also look in the the changelog for more information related to balancing and migrations. If balancing is seeing issues I would recommend getting the logs, changelog information and posting a separate question to troubleshoot that issue.

Best Answer

Related Solutions

Mongodb: slow shard balancing

Mongodb – Mongo sharding issue with chunk split and Data transfer

Related Question