Mongodb – verify that mongos server is connected to config servers

mongodbsharding

I've been writing a backup script for sharded replica-sets and it's almost done. Except I can't seem to get it to successfully start the balancer backup after everything's all said and done.

Here's the command I'm trying to use to start the balancer back up; keep in mind that this is being run on the actual mongos server via SSH.

sudo -s
mongo -u username -p password --authenticationDatabase db
use config
sh.setBalancerState(true)
exit
exit
exit

I keep getting the following error whenever the scripts hits the startBalancer function, which runs the above code.

SyncClusterConnection::udpate prepare failed:  mongo-conf-0.foo.bar.com:27019:10276 
DBClientBase::findN: transport error: mongo-conf-0.foo.bar.com:27019 
ns: admin.$cmd query: { resetError: 1 }

I've tried checking against the exit status of the mongo shell process, using something like

if (code != 0) {
  return next('repeat');
} else {
  return next();
}

but regardless of what actually occurs in the mongo-shell, the exit code seems to always be 0.

Any ideas on how I can verify that the mongos process is actually connected to all three configs before I try to re-enable the balancer? I assume the problem is that the mongos server tries to connect to the config server before the mongod process had a chance to finish starting up (part of the backup process for sharded replica-sets is shutting down one of the config servers)

Best Answer

Have you tried using the sh.startBalancer() helper instead?

Rather than a straight update, it does takes an timeout argument as how long to wait for balancing to start as well as a sleep interval in terms of how long to sleep between waiting. Here's the code from the shell by way of explanation:

mongos> sh.startBalancer
function ( timeout, interval ) {
    sh.setBalancerState( true )
    sh.waitForBalancer( true, timeout, interval )
}

So, you could even break it up and use the waitForBalancer helper if you wished. For reference, here is the equivalent stopBalancer command erroring out when I tried to stop it with a config server down:

mongos> sh.stopBalancer(2000, 100)
Waiting for active hosts...
Waiting for active host adamc-mbp.local:30999 to recognize new settings... (ping : Tue Dec 31 2013 19:51:32 GMT+0000 (GMT))
Waiting for the balancer lock...
Waiting again for active hosts after balancer is off...
Tue Dec 31 19:51:39.243 error: {
    "$err" : "error creating initial database config information :: caused by :: SyncClusterConnection::udpate prepare failed:  localhost:29000:9001 socket exception [FAILED_STATE] server [localhost:29000] ",
    "code" : 8005
} at src/mongo/shell/query.js:128