I have a very simple MongoDB database structure. Also I have 5 shards in it (3 of them are replicasets). But we are testing DB loading and it seems that shard balancing isn't currently enabled.
On mongos
I have checked the next:
mongos> sh.getBalancerState()
true
mongos> sh.isBalancerRunning()
false
I can't set BalancerRunning to true. I tried:
sh.startBalancer()
Please help me to start it for all my shards. Thank you.
Best Answer
Basically you have a few misunderstandings here, the first being that the balancer is a load balancer. It is not - it simply looks to address imbalances in chunk counts on your shards. That can have the side effect of balancing your traffic out as it moves chunks around, but strictly speaking it is not a load balancer. It also does not run continuously, rather it runs when there is work to be done and imbalances to address, otherwise it is dormant.
To explain the output you are getting from the commands, let's take them one at a time. First off, Let's look at what
sh.getBalancerState()
does (run any function without parentheses in themongo
shell and you get to see the code behind it):So, what that command is doing is checking the settings collection in the config DB to determine if the balancer is enabled or not. If we stop the balancer, we see the setting change:
If we flip it back to enabled, we see true returned once more:
So,
sh.getBalancerState()
is basically for checking the setting and telling you whether the balancer is enabled or not. What it does not speak to is whether the balancer is currently actively running (i.e. checking for imbalances, migrating to address any imbalances it finds). That's wheresh.isBalancerRunning()
comes in.However, if the balancer is not currently doing any work, it will not be "running" and so it will return false:
Hence, let's give it some work to do. I will re-use my example from this answer and create an imbalance while the balancer is off. Here is
sh.status()
and the output ofsh.getBalancerState()
once I have completed the pre-split:Once I re-enable the balancer, it is going to have plenty of work to do to redistribute those 2049 (empty) chunks evenly across 3 shards, so I will have plenty of opportunity to run
sh.isBalancerRunning()
and get a positive. Interestingly, it took me several tries to get this to return true (just showing two for brevity):Why is that? Well, let's look at the function again:
It is a query on the config database again, this time on the locks collection. It looks for a lock belonging to the balancer and then returns true if the state is greater than 0. Here are two examples of the document, one that returns false and one that returns true:
If you look closely, you will notice that the
ts
fields are essentially consecutive, and what you will see with empty chunks is that the non-zero states are very transient. If I fill up the chunks with data it is far easier to generate a positive.There you have it - a full explanation of the commands you were running and why you got the results you saw. I suspect that the root of the question is actually related to a traffic imbalance, but it is not the balancer that generally causes that type of problem (as mentioned before, it is not a load balancer) - traffic imbalance is more likely cause by: