I've one collection enable sharding by id key on 2 shard rs01
and rs02
Today I added a new shard rs03
into cluster.
I've checked balancer status by sh.status()
as below:
shards:
{ "_id" : "rs01", "host" : "rs01/mongo30-01:27017,mongo30-02:27017" }
{ "_id" : "rs02", "host" : "rs02/mongo30-03:27017,mongo30-04:27017" }
{ "_id" : "rs03", "host" : "rs03/mongo30-05:27017,mongo30-06:27017" }
balancer:
Currently enabled: yes
Currently running: yes
and some chunks already migrated to rs03
{ "_id" : "myDB", "partitioned" : true, "primary" : "rs01" }
myDB.myCo
shard key: { "_id" : 1 }
chunks:
rs01 20027
rs02 19532
rs03 8
I found I set too few memory on rs03
.
I stopped mongod
service on rs03
, added memory and start it after reboot.
After many hours, I found the migrated chunks still 8
.
I checked the operation in rs03
primary by db.currentOp()
as below:
{
"inprog": [
{
"desc": "migrateThread",
"threadId": "0x22677e00",
"opid": 206642,
"active": true,
"secs_running": 19754,
"microsecs_running": NumberLong("19754679731"),
"op": "none",
"ns": "myDB.myCo",
"query": {
},
"msg": "step 2 of 6",
"numYields": 0,
"locks": {
},
"waitingForLock": false,
"lockStats": {
"Global": {
"acquireCount": {
"r": NumberLong("662"),
"w": NumberLong("660")
}
},
"Database": {
"acquireCount": {
"r": NumberLong("1"),
"w": NumberLong("659"),
"W": NumberLong("1")
}
},
"Collection": {
"acquireCount": {
"r": NumberLong("1"),
"w": NumberLong("330"),
"W": NumberLong("1")
}
},
"oplog": {
"acquireCount": {
"w": NumberLong("328")
}
}
}
},
the opid kept as "step 2 of 6" for many hours.
Anything I can do to make the migration going on?
I tried to kill the opid and it forked again still "step 2 of 6" now.
and the chunks in sh.status()
still 8
.
Best Answer
The definition of the steps referenced in a chunk migration are a little fluid, and can depend on the version you are running. However, based on what you have provided I suspect that the primary on
rs01
and the number of chunks you are trying to move is the source of the issue here.Last time I looked at this in detail (around version 2.6), step 2 was basically a "sanity check" on the source shard primary to make sure it was ready to kick off another migration. There are two common reasons (and some not so common) for the primary to get "stuck" in step 2:
Given that your two existing shards were not yet balanced when you added a third shard, though
rs03
will now be the new preferred destination for migrating chunks since it has the lowest total, adding a third shard will not make this happen any more quickly. In fact it just means there are more chunks to move than before you added the third shard.To confirm what the root cause is here, you will need to take a look at the logs on the primary of shard
rs01
. The shard with the most chunks will always be chosen as the source of a migration, and the primary will be the one doing all the work. The logs should tell you why the migrations are being aborted or taking a long time.If it is a case of being too busy, you do have an option that will free it up, temporarily at least (it may get stuck again later). You can step down the primary, thereby killing any background delete jobs it is running for cleanup.
However, this will mean leaving behind "orphaned chunks" and you will need the clean up command to fix that later - you are not solving anything per se, you are just putting the work off until later. There are also only limited options to tweak this behavior, for more see here.
There are other options in terms of balancing chunks more quickly (mitosis), but they are way beyond the scope of an answer on this site I'm afraid.
If there is an error being hit, and the cleanup is not the issue, then it would probably be best to get the primary log messages and post a separate question to figure out the cause.