Mongodb – Mongo db sharded “the collection’s metadata lock is taken” issue coming very frequently

mongodbsharding

I am seeing this warning in mongoS log consistently for one collection.

[conn96] warning: splitChunk failed - cmd: { splitChunk: "ibeat20150105.dgrpCount", keyPattern: { articleId: 1, host: 1 }, min: { articleId: MinKey, host: MinKey }, max: { articleId: MaxKey, host: MaxKey }, from: "shard0000", splitKeys: [ { articleId: "01a12144225c646875aeea79990432", host: "xyz.com" } ], shardId: "ibeat20150105.dgrpCount-articleId_MinKeyhost_MinKey", configdb: "x.x.x.x:27017,x.x.x.x:27017,x.x.x.x:27017" } result: { who: { _id: "ibeat20150105.dgrpCount", state: 1, who: "ibeatdb61:27017:1420207410:680478249:conn105:432206596", ts: ObjectId('54aa4f1c253720e5ba254fde'), process: "ibeatdb61:27017:1420207410:680478249", when: new Date(1420447516889), why: "split-{ articleId: MinKey, host: MinKey }" }, ok: 0.0, errmsg: "the collection's metadata lock is taken" }

This is coming continuously for one collection.
Also when i checked the data distribution this whole data is on one shard only for this particular collection.
I check lock with id: balancer it seems to be working fine. as i did not get a time for which it has locked indefinitely.
I have 5 mongos servers running in my application

This is creating load on my primary shard which is impacting write operations.
Could you please help me.

Best Answer

There are two things which take a meta data lock on a collection - the balancer (to move a chunk) and a mongos (to split a chunk). Now, sometimes you will get this warning (it is just that, a warning) if, for example, a mongos attempts to split while another mongos is already performing a split. Or, it might be that the balancer is moving chunks and so the split cannot get a lock for that reason.

In either case, the chunk will usually get split eventually and things will settle down, but you have indicated that is not happening. If the balancer is running constantly then you may want to re-evaluate your shard key choice and whether you need to pre-split etc. to alleviate the imbalances you are creating.

In any case, one thing that you can do to give the mongos a better chance to be able to split without an error is to temporarily disable the balancer to see if stopping it allows the splits to happen. This is covered as part of the more detailed and complicated procedure I am about to cover for the more extreme case.

That extreme case would involve some sort of issue with stale locks. As mentioned in the comments, this is intended as a break-fix type of intervention to get that particular collection split. It is a lot of work, and it is not really for beginners. The lock that is blocking the splits should expire eventually on its own eventually, but you may not have the time to wait for that to happen. To force the issue, we need to basically do the following:

  • Reduce contention for the meta data lock (balancer, and splits)
  • Connect to a mongos, split the chunk on the collection manually
  • Manually move some chunks for the collection onto other shards (manual balancing)
  • Restore normal balancing and splitting

Note: this will require you to restart mongos processes if you want to be absolutely sure. That will usually mean down time for your application and some errors getting thrown back to your app when you do the restarts. For this reason, and for general safety I would recommend doing this at a low traffic or scheduled maintenance time. For reference in terms of how splitting happens, see this answer.

First, turn off the balancer:

sh.stopBalancer()

The command will wait for balancing to complete before returning, so this may take some time if the balancer is currently actively balancing chunks. To verify the state of the balancer you can use sh.getBalancerState() and sh.isBalancerRunning(). This is also the piece you should try initially as mentioned at the start of the answer.

Next, restart all of your current mongos processes with the --noAutoSplit option. By doing these two things you have essentially guaranteed that:

  1. Nothing else will be trying to make meta data changes on this cluster (which will allow you to do so manually without error)
  2. If there are any remaining locks blocking you from taking action in the config.locks collection, they are now definitely stale and can be removed if necessary.

Next, you need to manually split the chunk on the collection that has been failing. The easiest command to do so is splitFind(). It will basically work with any valid query for you since you only have one (Min to Max) chunk:

sh.splitFind( "ibeat20150105.dgrpCount", { articleId: $SOMEVALUE, host: $SOMEVALUE } )

This will split the chunk in two, and you can repeat for the resulting chunks (with different values to target different chunks) to get to a decent number - I would aim for at least the same number of chunks as you have shards so that you can move data to them, but basically the only real limit is that the new chunks must be less than you max chunk size (64MB by default) in order to be moved successfully. If you know your current approximate data size (from db.collection.stats()) for example, then it should be easy to figure out how many chunks you are going to need.

If you encounter a similar error regarding locks being taken and you are sure you have disabled the balancer and splitting on all mongos processes then you can remove the entries from the config.locks collection (because they must be stale) manually and retry. As long as you run the remove command from a mongos and not directly on the config servers, this should be safe enough.

Once you have your split and appropriately sized chunks you can move them from shard0000 with the moveChunk() command, something like this (to move to the theoretical shard0001):

sh.moveChunk("ibeat20150105.dgrpCount", { articleId: $SOMEVALUE, host: $SOMEVALUE }, "shard0001")

Once your chunks have been moved to your satisfaction, you can restart your mongos processes without the --noAutoSplit option, returning them to normal. When that is complete, re-enable normal balancing with sh.startBalancer().