MongoDB document found on replica set but not in mongos

mongodbsharding

I have a sharded cluster: two shards – each a replica set with 3 nodes, and a config replica set (with 3 nodes).

I sharded my cluster by two collections in the same database:
players.user_id – it's a hashed shard key
users._id – an unhashed shard key

I wanted to shard users._id with a hashed key. So I followed these steps https://docs.mongodb.org/manual/faq/sharding/#can-i-change-the-shard-key-after-sharding-a-collection.

Notes:
1. Since the users collection was already sharded, each backup and restore was done on the same replica set; the backup & restore was done with data dumps.
2. After I dropped my users collection I created the collection and added the _id: hashed index on it and then configured the sharding with _id: hashed.
3. After the sharding I restored the data and ran my application, which created other indexes, a _id: 1 index and other indexes – some of them being sparse.

After this I have a lot of users (like half of them) which I can find querying the mongod on any of my replica set nodes (from where I backed them up and restored), but not querying the mongos.

Also, my counts are wrong:

mongos> db.users.find({}).length()
355
mongos> db.users.find({}).count()
664

repl-set-1:SECONDARY> db.users.find({}).length()
36
repl-set-2:SECONDARY> db.users.find({}).length()
628

I have balacing enabled, all of my replica sets nodes are in sync within their replica sets, each server can reach the others, I don't have timeouts or other visible configuration problems.

mongos> sh.status()
--- Sharding Status ---
  sharding version: {
    "_id" : 1,
    "minCompatibleVersion" : 5,
    "currentVersion" : 6,
    "clusterId" : ObjectId("...")
}
  shards:
    {  "_id" : "repl-set-1",  "host" : "repl-set-1/repl-set-1-1:27017,repl-set-1-2:27017,repl-set-1-3:27017" }
    {  "_id" : "repl-set-2",  "host" : "repl-set-2/repl-set-2-1:27017,repl-set-2-2:27017,repl-set-2-3:27017" }
  active mongoses:
    "3.2.1" : 3
  balancer:
    Currently enabled:  yes
    Currently running:  no
    Failed balancer rounds in last 5 attempts:  0
    Migration Results for the last 24 hours:
        No recent migrations
  databases:
    {  "_id" : "mydb",  "primary" : "repl-set-2",  "partitioned" : true }
        mydb.players
            shard key: { "user" : "hashed" }
            unique: false
            balancing: true
            chunks:
                repl-set-1  2
                repl-set-2  2
            { "user" : { "$minKey" : 1 } } -->> { "user" : NumberLong("-4611686018427387902") } on : repl-set-1 Timestamp(2, 2)
            { "user" : NumberLong("-4611686018427387902") } -->> { "user" : NumberLong(0) } on : repl-set-1 Timestamp(2, 3)
            { "user" : NumberLong(0) } -->> { "user" : NumberLong("4611686018427387902") } on : repl-set-2 Timestamp(2, 4)
            { "user" : NumberLong("4611686018427387902") } -->> { "user" : { "$maxKey" : 1 } } on : repl-set-2 Timestamp(2, 5)
        mydb.users
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
      repl-set-1    2
      repl-set-2    2
            { "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611686018427387902") } on : repl-set-1 Timestamp(2, 2)
            { "_id" : NumberLong("-4611686018427387902") } -->> { "_id" : NumberLong(0) } on : repl-set-1 Timestamp(2, 3)
            { "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611686018427387902") } on : repl-set-2 Timestamp(2, 4)
            { "_id" : NumberLong("4611686018427387902") } -->> { "_id" : { "$maxKey" : 1 } } on : repl-set-2 Timestamp(2, 5)

My questions are:
1. How can I debug this? I tried a lot of things…
2. What may cause this? The way that I backed up and restored data, the presence of sparse indexes?
3. How can I fix this, if I can?

Best Answer

Have you followed this guide from MongoDB to restore a sharded cluster ? Apparently it is different from restoring a non-sharded cluster.

The key here is the backup and restore done on Shards for a sharded collection. The reason for the current state of your database is because, config server or cached information on Mongos are not aware of the restoring happened on Shards. So the meta information stored on config servers doesn't have latest information about the distribution of the data in shards.

You can try getting into proper state,

  1. By restoring the cluster by following the guide.

  2. by tying to remove shard and re-adding the shard again. So that MongoDB would drain all the the data from one replica-set to the primary replica set and then re-balance the data, after re-adding the shard. You may use, removeShard and addShard operations.