MongoDB – How to Activate Shard Balancer in Sharded Cluster

mongodbsharding

I have a very simple MongoDB database structure. Also I have 5 shards in it (3 of them are replicasets). But we are testing DB loading and it seems that shard balancing isn't currently enabled.

On mongos I have checked the next:

mongos> sh.getBalancerState()
true
mongos> sh.isBalancerRunning()
false

I can't set BalancerRunning to true. I tried:

sh.startBalancer()

Please help me to start it for all my shards. Thank you.

Best Answer

Basically you have a few misunderstandings here, the first being that the balancer is a load balancer. It is not - it simply looks to address imbalances in chunk counts on your shards. That can have the side effect of balancing your traffic out as it moves chunks around, but strictly speaking it is not a load balancer. It also does not run continuously, rather it runs when there is work to be done and imbalances to address, otherwise it is dormant.

To explain the output you are getting from the commands, let's take them one at a time. First off, Let's look at what sh.getBalancerState() does (run any function without parentheses in the mongo shell and you get to see the code behind it):

mongos> sh.getBalancerState
function () {
    var x = db.getSisterDB( "config" ).settings.findOne({ _id: "balancer" } )
    if ( x == null )
        return true;
    return ! x.stopped;
}

So, what that command is doing is checking the settings collection in the config DB to determine if the balancer is enabled or not. If we stop the balancer, we see the setting change:

mongos> sh.stopBalancer()
Waiting for active hosts...
Waiting for the balancer lock...
Waiting again for active hosts after balancer is off...
mongos> sh.getBalancerState()
false

If we flip it back to enabled, we see true returned once more:

mongos> sh.startBalancer()
mongos> sh.getBalancerState()
true

So, sh.getBalancerState() is basically for checking the setting and telling you whether the balancer is enabled or not. What it does not speak to is whether the balancer is currently actively running (i.e. checking for imbalances, migrating to address any imbalances it finds). That's where sh.isBalancerRunning() comes in.

However, if the balancer is not currently doing any work, it will not be "running" and so it will return false:

mongos> sh.isBalancerRunning()
false

Hence, let's give it some work to do. I will re-use my example from this answer and create an imbalance while the balancer is off. Here is sh.status() and the output of sh.getBalancerState() once I have completed the pre-split:

mongos> sh.getBalancerState()
false
mongos> sh.status()
--- Sharding Status --- 
  sharding version: {
    "_id" : 1,
    "version" : 3,
    "minCompatibleVersion" : 3,
    "currentVersion" : 4,
    "clusterId" : ObjectId("53b5d3b5d95df3a66a597548")
}
  shards:
    {  "_id" : "shard0000",  "host" : "localhost:30000" }
    {  "_id" : "shard0001",  "host" : "localhost:30001" }
    {  "_id" : "shard0002",  "host" : "localhost:30002" }
  databases:
    {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
    {  "_id" : "test",  "partitioned" : false,  "primary" : "shard0001" }
    {  "_id" : "users",  "partitioned" : true,  "primary" : "shard0001" }
        users.userInfo
            shard key: { "_id" : 1 }
            chunks:
                shard0001   2049
            too many chunks to print, use verbose if you want to force print

Once I re-enable the balancer, it is going to have plenty of work to do to redistribute those 2049 (empty) chunks evenly across 3 shards, so I will have plenty of opportunity to run sh.isBalancerRunning() and get a positive. Interestingly, it took me several tries to get this to return true (just showing two for brevity):

mongos> sh.isBalancerRunning()
false
mongos> sh.isBalancerRunning()
true

Why is that? Well, let's look at the function again:

mongos> sh.isBalancerRunning
function () {
    var x = db.getSisterDB("config").locks.findOne({ _id: "balancer" });
    if (x == null) {
        print("config.locks collection empty or missing. be sure you are connected to a mongos");
        return false;
    }
    return x.state > 0;
}

It is a query on the config database again, this time on the locks collection. It looks for a lock belonging to the balancer and then returns true if the state is greater than 0. Here are two examples of the document, one that returns false and one that returns true:

db.getSisterDB("config").locks.findOne({ _id: "balancer" });
{
    "_id" : "balancer",
    "process" : "adamc-mbp:30999:1404425140:16807",
    "state" : 2,
    "ts" : ObjectId("53b5d86fd95df3a66a5975ff"),
    "when" : ISODate("2014-07-03T22:25:51.574Z"),
    "who" : "adamc-mbp:30999:1404425140:16807:Balancer:1622650073",
    "why" : "doing balance round"
}
db.getSisterDB("config").locks.findOne({ _id: "balancer" });
{
    "_id" : "balancer",
    "process" : "adamc-mbp:30999:1404425140:16807",
    "state" : 0,
    "ts" : ObjectId("53b5d86ed95df3a66a5975fe"),
    "when" : ISODate("2014-07-03T22:25:50.528Z"),
    "who" : "adamc-mbp:30999:1404425140:16807:Balancer:1622650073",
    "why" : "doing balance round"
}

If you look closely, you will notice that the ts fields are essentially consecutive, and what you will see with empty chunks is that the non-zero states are very transient. If I fill up the chunks with data it is far easier to generate a positive.

There you have it - a full explanation of the commands you were running and why you got the results you saw. I suspect that the root of the question is actually related to a traffic imbalance, but it is not the balancer that generally causes that type of problem (as mentioned before, it is not a load balancer) - traffic imbalance is more likely cause by:

  • A poor shard key (perhaps very poor)
  • Balancing is not enabled for the database/collection (see previous pre-split answer for instructions there)
  • Something is preventing the balancer from working (config server down, balancer window, migration aborts)