Mongodb – Mongos down when I shutdown one of three servers

mongodb

I deploy 3 server of mongodb, layout is

  • node-01:
    • mongos
    • config server 1
    • replication server 1
  • node-02:
    • mongos
    • config server 2
    • replication server 2
  • node-03:
    • config server 3
    • arbiter

After deployment I found that if I shutdown node-01 or node-02, mongos fail immediately.

I think it's because replication server down will require metadata change, but I didn't find information from google.

Here is a part of log from mongos:

> 2014-05-21T18:57:03.888+0800 [Balancer] reconnect 10.162.54.97:28002 (10.162.54.97) failed failed couldn't connect to server 10.162.54.97:28002 (10.162.54.97), connection attempt failed
> 2014-05-21T18:57:08.892+0800 [Balancer] scoped connection to 10.161.236.222:28001,10.162.54.97:28002,10.132.42.79:28003 not being returned to the pool
> 2014-05-21T18:57:08.892+0800 [Balancer] caught exception while doing balance: error checking clock skew of cluster 10.161.236.222:28001,10.162.54.97:28002,10.132.42.79:28003 :: caused by :: 13647 could not get status from server 10.162.54.97:28002 in cluster 10.162.54.97:28002 to check time :: caused by :: 11002 socket exception [CONNECT_ERROR] server [10.162.54.97:28002] connection pool error: couldn't connect to server 10.162.54.97:28002 (10.162.54.97), connection attempt failed
> 2014-05-21T18:57:10.176+0800 [ReplicaSetMonitorWatcher] warning: Failed to connect to 10.162.54.97:27017, reason: errno:115 Operation now in progress
> 2014-05-21T18:57:12.693+0800 warning:  couldn't check dbhash on config server 10.162.54.97:28002 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [10.162.54.97:28002] connection pool error: couldn't connect to server 10.162.54.97:28002 (10.162.54.97), connection attempt failed
> 2014-05-21T18:57:19.895+0800 [Balancer] SyncClusterConnection connecting to [10.161.236.222:28001]
> 2014-05-21T18:57:19.895+0800 [Balancer] SyncClusterConnection connecting to [10.162.54.97:28002]
> 2014-05-21T18:57:24.896+0800 [Balancer] SyncClusterConnection connect fail to: 10.162.54.97:28002 errmsg: couldn't connect to server 10.162.54.97:28002 (10.162.54.97), connection attempt failed
> 2014-05-21T18:57:24.896+0800 [Balancer] SyncClusterConnection connecting to [10.132.42.79:28003]
> 2014-05-21T18:57:24.899+0800 [Balancer] trying reconnect to 10.162.54.97:28002 (10.162.54.97) failed
> 2014-05-21T18:57:25.176+0800 [ReplicaSetMonitorWatcher] warning: Failed to connect to 10.162.54.97:27017, reason: errno:115 Operation now in progress
> 2014-05-21T18:57:29.899+0800 [Balancer] reconnect 10.162.54.97:28002 (10.162.54.97) failed failed couldn't connect to server 10.162.54.97:28002 (10.162.54.97), connection attempt failed

Best Answer

When you say "shutdown node-01 or node-02" what are you actually shutting down? (mongod service, config server service, the host?) Please also state what node you are shutting down in the example you gave, as well as the which node you took the mongos log from.

You can run your config server on the same host as your replica set. However, you may wish to run your config servers on separate hosts. That way, if a replica set host goes down, you don't loose a config server also. If you do loose a config server, your cluster won't be able to migrate chunk data on sharded collections. (this is what is happening in your logs, the balancer will remain disabled until your config server comes back on line.)

Note: I probably would not have apps connecting to the mongos running on the replica sets. It is fine that they are running there, admins have easy access to a mongos to connect to. But, I would run a mongos that sits outside these hosts. Otherwise a host going down in your cluster means your app, or part of it at least, is also down.

Replication is unaffected by a config server going down (just chunk migration). Do you have a sharded collection in this cluster?

Follow up:

Not sure what happened with my comment. I'll try it this way:

You can absolutely run your config servers on the same hosts as your replicas. Here is a quote from Chodorow, Kristina. MongoDB: The Definitive Guide:. Beijing: O'Reilly, 2013. page 243.

"In terms of provisioning, config servers do not need much space or many resources. A generous estimate is 1 KB of config server space per 200 MB of actual data: they really are just tables of contents. As they don't use many resources, you can deploy config servers on machines running other things, like app servers, shard mongods, or mongos processes."

Something else is going on. When you connect your mongos to the config servers, are you adding them using IP addresses or host names? What is the output of rs.conf() and rs.status() before and after shutting down the node? I notice mongos is attempting to connect on port 27017. Typically, in a sharded setup, your mongod port will be 27018 and config port will be 27019. Did you change the port numbers?

I suspect you have something misconfigured. Mongo's fail-over is pretty rock solid.