Mongodb config servers not in sync

mongodb

I have setup with 2 shards, with 2 replica servers and 3 config servers, and 2 mongos. I have following problems:

1) mongo config servers out of sync:

Aug 14 09:46:48 server mongos.27017[10143]: Sun Aug 11 09:46:48.987 [CheckConfigServers] ERROR: config servers not in sync! config servers mongocfg1.testing.com:27000 and mongocfg3.testing.com:27000 differ#012chunks: "d2c08c5f1ee6048e5f6fab30e37a70f0"#011chunks: "7e643e9402ba90567ddc9388c2abdb8a"#012databases: "6f35ec52b536eee608d5bc706a72ec1e"#011databases: "6f35ec52b536eee608d5bc706a72ec1e"

2) I use this document to sync servers: http://docs.mongodb.org/manual/tutorial/replace-config-server/ 3) After sync i restart one mongos server, and see this in logs:

Thu Aug 15 09:56:05.376 [mongosMain] MongoS version 2.4.4 starting: pid=1575 port=27111 64-bit host=web-inno.innologica.com (--help for usage)
Thu Aug 15 09:56:05.376 [mongosMain] git version: 4ec1fb96702c9d4c57b1e06dd34eb73a16e407d2
Thu Aug 15 09:56:05.376 [mongosMain] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Thu Aug 15 09:56:05.376 [mongosMain] options: { configdb: "mongocfg1.testing.com:27000,mongocfg2.testing.com:27000,mongocfg3.testing.com:27000", keyFile: "/mongo_database/pass.key", port: 27111 }
Thu Aug 15 09:56:05.582 [mongosMain] SyncClusterConnection connecting to [mongocfg1.testing.com:27000]
Thu Aug 15 09:56:05.583 [mongosMain] SyncClusterConnection connecting to [mongocfg2.testing.com:27000]
Thu Aug 15 09:56:05.583 [mongosMain] SyncClusterConnection connecting to [mongocfg3.testing.com:27000]
Thu Aug 15 09:56:05.585 [mongosMain] SyncClusterConnection connecting to [mongocfg1.testing.com:27000]
Thu Aug 15 09:56:05.586 [mongosMain] SyncClusterConnection connecting to [mongocfg2.testing.com:27000]
Thu Aug 15 09:56:05.586 [mongosMain] SyncClusterConnection connecting to [mongocfg3.testing.com:27000]
Thu Aug 15 09:56:07.213 [Balancer] about to contact config servers and shards
Thu Aug 15 09:56:07.213 [websvr] admin web console waiting for connections on port 28111
Thu Aug 15 09:56:07.213 [Balancer] starting new replica set monitor for replica set replica01 with seed of mongo1.testing.com:27020,mongo2.testing.com:27020,mongo3.testing.com:27017
Thu Aug 15 09:56:07.214 [Balancer] successfully connected to seed mongo1.testing.com:27020 for replica set replica01
Thu Aug 15 09:56:07.214 [Balancer] changing hosts to { 0: "mongo1.testing.com:27020", 1: "mongo2.testing.com:27020" } from replica01/
Thu Aug 15 09:56:07.214 [Balancer] trying to add new host mongo1.testing.com:27020 to replica set replica01
Thu Aug 15 09:56:07.215 [Balancer] successfully connected to new host mongo1.testing.com:27020 in replica set replica01
Thu Aug 15 09:56:07.215 [Balancer] trying to add new host mongo2.testing.com:27020 to replica set replica01
Thu Aug 15 09:56:07.215 [Balancer] successfully connected to new host mongo2.testing.com:27020 in replica set replica01
Thu Aug 15 09:56:07.215 [mongosMain] waiting for connections on port 27111
Thu Aug 15 09:56:07.427 [Balancer] Primary for replica set replica01 changed to mongo1.testing.com:27020
Thu Aug 15 09:56:07.429 [Balancer] replica set monitor for replica set replica01 started, address is replica01/mongo1.testing.com:27020,mongo2.testing.com:27020
Thu Aug 15 09:56:07.429 [ReplicaSetMonitorWatcher] starting
Thu Aug 15 09:56:07.430 [Balancer] starting new replica set monitor for replica set replica02 with seed of mongo5.testing.com:27020,mongo6.testing.com:27020
Thu Aug 15 09:56:07.431 [Balancer] successfully connected to seed mongo5.testing.com:27020 for replica set replica02
Thu Aug 15 09:56:07.432 [Balancer] changing hosts to { 0: "mongo5.testing.com:27020", 1: "mongo6.testing.com:27020" } from replica02/
Thu Aug 15 09:56:07.432 [Balancer] trying to add new host mongo5.testing.com:27020 to replica set replica02
Thu Aug 15 09:56:07.432 [Balancer] successfully connected to new host mongo5.testing.com:27020 in replica set replica02
Thu Aug 15 09:56:07.432 [Balancer] trying to add new host mongo6.testing.com:27020 to replica set replica02
Thu Aug 15 09:56:07.433 [Balancer] successfully connected to new host mongo6.testing.com:27020 in replica set replica02
Thu Aug 15 09:56:07.712 [Balancer] Primary for replica set replica02 changed to mongo5.testing.com:27020
Thu Aug 15 09:56:07.714 [Balancer] replica set monitor for replica set replica02 started, address is replica02/mongo5.testing.com:27020,mongo6.testing.com:27020
Thu Aug 15 09:56:07.715 [Balancer] config servers and shards contacted successfully
Thu Aug 15 09:56:07.715 [Balancer] balancer id: web-inno.innologica.com:27111 started at Aug 15 09:56:07
Thu Aug 15 09:56:07.715 [Balancer] SyncClusterConnection connecting to [mongocfg1.testing.com:27000]
Thu Aug 15 09:56:07.716 [Balancer] SyncClusterConnection connecting to [mongocfg2.testing.com:27000]
Thu Aug 15 09:56:24.438 [mongosMain] connection accepted from 127.0.0.1:55303 #1 (1 connection now open)
Thu Aug 15 09:56:24.443 [conn1]  authenticate db: admin { authenticate: 1, nonce: "6cc9a76b79656179", user: "admin", key: "xxxxxxxxxxxxxxxxxxx" }
Thu Aug 15 09:56:26.676 [conn1] creating WriteBackListener for: mongo1.testing.com:27020 serverID: 520c7b87e4a4c3afa569b21a
Thu Aug 15 09:56:26.676 [conn1] creating WriteBackListener for: mongo2.testing.com:27020 serverID: 520c7b87e4a4c3afa569b21a
Thu Aug 15 09:56:26.678 [conn1] creating WriteBackListener for: mongo5.testing.com:27020 serverID: 520c7b87e4a4c3afa569b21a
Thu Aug 15 09:56:26.678 [conn1] creating WriteBackListener for: mongo6.testing.com:27020 serverID: 520c7b87e4a4c3afa569b21a
Thu Aug 15 09:56:26.679 [conn1] SyncClusterConnection connecting to [mongocfg1.testing.com:27000]
Thu Aug 15 09:56:26.679 [conn1] SyncClusterConnection connecting to [mongocfg2.testing.com:27000]
Thu Aug 15 09:56:26.680 [conn1] SyncClusterConnection connecting to [mongocfg3.testing.com:27000]
Thu Aug 15 09:57:33.704 [conn1] warning: inconsistent chunks found when reloading collection.documents, previous version was 8651|7||51b5c7a96b2903a0b3fac106, this should be rare
Thu Aug 15 09:57:33.714 [conn1] warning: ChunkManager loaded an invalid config for collection.documents, trying again
Thu Aug 15 09:57:34.065 [conn1] warning: inconsistent chunks found when reloading collection.documents, previous version was 8651|7||51b5c7a96b2903a0b3fac106, this should be rare
Thu Aug 15 09:57:34.076 [conn1] warning: ChunkManager loaded an invalid config for collection.documents, trying again
Thu Aug 15 09:57:34.491 [conn1] warning: inconsistent chunks found when reloading collection.documents, previous version was 8651|7||51b5c7a96b2903a0b3fac106, this should be rare
Thu Aug 15 09:57:34.503 [conn1] warning: ChunkManager loaded an invalid config for collection.documents, trying again
Thu Aug 15 09:57:34.533 [conn1] Assertion: 13282:Couldn't load a valid config for collection.documents after 3 attempts. Please try again.
0xa82161 0xa46e8b 0xa473cc 0x8b857e 0x93cb52 0x93f329 0x93ff18 0x94311f 0x9740e0 0x991865 0x669887 0xa6e8ce 0x7f4456361851 0x7f445570790d
 /usr/bin/mongos(_ZN5mongo15printStackTraceERSo+0x21) [0xa82161]
 /usr/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x9b) [0xa46e8b]
 /usr/bin/mongos() [0xa473cc]
 /usr/bin/mongos(_ZN5mongo12ChunkManager18loadExistingRangesERKSs+0x24e) [0x8b857e]
 /usr/bin/mongos(_ZN5mongo8DBConfig14CollectionInfo5shardEPNS_12ChunkManagerE+0x52) [0x93cb52]
 /usr/bin/mongos(_ZN5mongo8DBConfig14CollectionInfoC1ERKNS_7BSONObjE+0x149) [0x93f329]
 /usr/bin/mongos(_ZN5mongo8DBConfig5_loadEv+0xa48) [0x93ff18]
 /usr/bin/mongos(_ZN5mongo8DBConfig4loadEv+0x1f) [0x94311f]
 /usr/bin/mongos(_ZN5mongo4Grid11getDBConfigESsbRKSs+0x480) [0x9740e0]
 /usr/bin/mongos(_ZN5mongo7Request5resetEv+0x1d5) [0x991865]
 /usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x67) [0x669887]
 /usr/bin/mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xa6e8ce]
 /lib64/libpthread.so.0(+0x7851) [0x7f4456361851]
 /lib64/libc.so.6(clone+0x6d) [0x7f445570790d]
Thu Aug 15 09:57:34.549 [conn1] scoped connection to mongocfg1.testing.com:27000,mongocfg2.testing.com:27000,mongocfg3.testing.com:27000 not being returned to the pool
Thu Aug 15 09:57:34.549 [conn1] warning: error loading initial database config information :: caused by :: Couldn't load a valid config for collection.documents after 3 attempts. Please try again.
Thu Aug 15 09:57:34.549 [conn1] AssertionException while processing op type : 2004 to : collection.system.namespaces :: caused by :: 13282 error loading initial database config information :: caused by :: Couldn't load a valid config for collection.documents after 3 attempts. Please try again.
Thu Aug 15 09:57:37.722 [Balancer] SyncClusterConnection connecting to [mongocfg1.testing.com:27000]
Thu Aug 15 09:57:37.723 [Balancer] SyncClusterConnection connecting to [mongocfg2.testing.com:27000]
Thu Aug 15 09:57:37.723 [Balancer] SyncClusterConnection connecting to [mongocfg3.testing.com:27000]

First mongos also have this error
"warning: error loading initial database config information :: caused by :: Couldn't load a valid config for collection.documents after 3 attempts. Please try again."

but work for now.

Second mongos after restart don't work;

mongos> show collections
Thu Aug 15 09:57:34.550 JavaScript execution failed: error: {
    "$err" : "error loading initial database config information :: caused by :: Couldn't load a valid config for collection.documents after 3 attempts. Please try again.",
    "code" : 13282
} at src/mongo/shell/query.js:L128
mongos>

What is the next steps to recover config servers?

All advice are welcome.

Best Answer

Restoring config servers, particularly if you have had some sort of catastrophic event is tricky, but not impossible. But, before we go any further, a big bold caveat:

BACK UP EVERYTHING

That means taking a back up of all three config servers. I am going to give you some advice, and it is generally correct, but please, please take a back up of every current config server instance before you overwrite/replace anything

As a quick explanation, config servers are not configured as a replica set - each config server instance is supposed to be identical (at least for all the collections that matter) to the others. Hence, any healthy config server can be used to replace a non-healthy config server and you can then follow the tutorial you mentioned to get back to a good config.

The key to recovery is to identify the healthy config server and then use that to replace the others - you then end up with 3 identical config servers.

There is more than one way to do this, they basically fall into three categories:

1) Use the error message

The error message that is printed out actually lets you know which config server it believes is health, though that is not obvious from the messaging. Here's how to read it generically:

ERROR: config servers not in sync! config servers <healthy-server> and <out-of-sync-server> differ

Basically the first one in the list is the healthy one, in your case that would be mongocfg1.testing.com:27000. That is our first candidate for a healthy config database.

2) Use dbhash to compare all three and pick the ones that agree

On each config server switch to the config database using use config, run db.runCommand("dbhash") and compare the hashes for the collections below:

  • chunks
  • databases
  • settings
  • shards
  • version

You are looking for two servers that agree, and using that as the basis to determine that the version of the config database on those hosts is basically trustworthy and should be used to seed the rest.

3. Manually inspect the collections in the config database

Finally, take a look at the config database, and pay attention to the collections listed in the second option above. This is a straight judgement call based on your familiarity with your data.

Hopefully all three methods point you at the same host (or hosts). That config server should be used to seed the other two (after you have taken backups so you can go back). That is basically your best bet. Should that fail, then you may want to try one of the other versions (from the backups) - always making sure that when you start them, all three are identical.

Finally, always ensure that all mongos processes are using the same config server string, and that all 3 servers are always listed in the same order on every process - not doing so across all mongos processes can lead to (very) odd results.