MongoDB Sharded Cluster: Balancing chunks not working, stuck at “step 2”

mongodbsharding

I've one collection enable sharding by id key on 2 shard rs01 and rs02

Today I added a new shard rs03 into cluster.
I've checked balancer status by sh.status() as below:

shards:
    {  "_id" : "rs01",  "host" : "rs01/mongo30-01:27017,mongo30-02:27017" }
    {  "_id" : "rs02",  "host" : "rs02/mongo30-03:27017,mongo30-04:27017" }
    {  "_id" : "rs03",  "host" : "rs03/mongo30-05:27017,mongo30-06:27017" }
  balancer:
    Currently enabled:  yes
    Currently running:  yes

and some chunks already migrated to rs03

{  "_id" : "myDB",  "partitioned" : true,  "primary" : "rs01" }
    myDB.myCo
        shard key: { "_id" : 1 }
        chunks:
            rs01    20027
            rs02    19532
            rs03    8

I found I set too few memory on rs03.
I stopped mongod service on rs03, added memory and start it after reboot.

After many hours, I found the migrated chunks still 8.
I checked the operation in rs03 primary by db.currentOp() as below:

{
  "inprog": [
    {
      "desc": "migrateThread",
      "threadId": "0x22677e00",
      "opid": 206642,
      "active": true,
      "secs_running": 19754,
      "microsecs_running": NumberLong("19754679731"),
      "op": "none",
      "ns": "myDB.myCo",
      "query": {

      },
      "msg": "step 2 of 6",
      "numYields": 0,
      "locks": {

      },
      "waitingForLock": false,
      "lockStats": {
        "Global": {
          "acquireCount": {
            "r": NumberLong("662"),
            "w": NumberLong("660")
          }
        },
        "Database": {
          "acquireCount": {
            "r": NumberLong("1"),
            "w": NumberLong("659"),
            "W": NumberLong("1")
          }
        },
        "Collection": {
          "acquireCount": {
            "r": NumberLong("1"),
            "w": NumberLong("330"),
            "W": NumberLong("1")
          }
        },
        "oplog": {
          "acquireCount": {
            "w": NumberLong("328")
          }
        }
      }
    },

the opid kept as "step 2 of 6" for many hours.
Anything I can do to make the migration going on?
I tried to kill the opid and it forked again still "step 2 of 6" now.
and the chunks in sh.status() still 8.

Best Answer

The definition of the steps referenced in a chunk migration are a little fluid, and can depend on the version you are running. However, based on what you have provided I suspect that the primary on rs01 and the number of chunks you are trying to move is the source of the issue here.

Last time I looked at this in detail (around version 2.6), step 2 was basically a "sanity check" on the source shard primary to make sure it was ready to kick off another migration. There are two common reasons (and some not so common) for the primary to get "stuck" in step 2:

  • The primary is "too busy" - i.e. it is processing a lot of deletes from previous migrations and does not want to take on more migrations until some of those finish
  • There is an error being hit (chunk too big for example) and the migration is being aborted

Given that your two existing shards were not yet balanced when you added a third shard, though rs03 will now be the new preferred destination for migrating chunks since it has the lowest total, adding a third shard will not make this happen any more quickly. In fact it just means there are more chunks to move than before you added the third shard.

To confirm what the root cause is here, you will need to take a look at the logs on the primary of shard rs01. The shard with the most chunks will always be chosen as the source of a migration, and the primary will be the one doing all the work. The logs should tell you why the migrations are being aborted or taking a long time.

If it is a case of being too busy, you do have an option that will free it up, temporarily at least (it may get stuck again later). You can step down the primary, thereby killing any background delete jobs it is running for cleanup.

However, this will mean leaving behind "orphaned chunks" and you will need the clean up command to fix that later - you are not solving anything per se, you are just putting the work off until later. There are also only limited options to tweak this behavior, for more see here.

There are other options in terms of balancing chunks more quickly (mitosis), but they are way beyond the scope of an answer on this site I'm afraid.

If there is an error being hit, and the cleanup is not the issue, then it would probably be best to get the primary log messages and post a separate question to figure out the cause.