Mongodb – Mongos version 3.0.1 Split Chunk Error with Sharding

mongodbsharding

I have 3 config server and two shards. Each shard has a primary, secondary and an arbiter.

This is running with the following mongod settings:

storageEngine=wiredTiger

wiredTigerJournalCompressor=zlib

wiredTigerCollectionBlockCompressor=zlib

The mongos instance created the following log entries below about not being able to get a lock during the sharding balancing. All the data can be queried fine however.

Also all seems fine with sh.status() command:

shards:

{  "_id" : "ib3_prod_rs0",  "host" : "ib3_prod_rs0/ip-10-0-13-119.ec2.internal:27017,ip-10-0-21-50:27017" }

{  "_id" : "ib3_prod_rs1",  "host" : "ib3_prod_rs1/ip-10-0-6-138:27017,ip-10-0-8-202.ec2.internal:27017" }

balancer:

Currently enabled:  yes

Currently running:  yes

    Balancer lock taken at Wed Apr 01 2015 00:00:49 GMT+0000 (UTC) by 54.163.248.60:27017:1427843620:1804289383:Balancer:1681692777

Failed balancer rounds in last 5 attempts:  0

Migration Results for the last 24 hours: 

    No recent migrations

How can I debug what is going on?

2015-03-31T23:41:06.542+0000 I SHARDING [Balancer] moveChunk result: {
ok: 0.0, errmsg: "could not acquire collection lock for
cubes_prod.pushqueues to migrate chunk [{ : MinKey },{ : MaxKey }) ::
caused by :: could not get status from ser…", $gleStats: {
lastOpTime: Timestamp 0|0, electionId:
ObjectId('551b29f564e83f84e725241e') } } 2015-03-31T23:41:06.543+0000
I SHARDING [Balancer] balancer move failed: { ok: 0.0, errmsg: "could
not acquire collection lock for cubes_prod.pushqueues to migrate chunk
[{ : MinKey },{ : MaxKey }) :: caused by :: could not get status from
ser…", $gleStats: { lastOpTime: Timestamp 0|0, electionId:
ObjectId('551b29f564e83f84e725241e') } } from: ib3_prod_rs1 to:
ib3_prod_rs0 chunk: min: { UserKey: MinKey } max: { UserKey: 0 }

Best Answer

According to what you found out yourself, I think I can provide a detailed explanation.

How chunk migration works (simplified)

Here is how chunk migration works (simplified for the sake of shortness):

When chunk exceeds the configured chunkSize (64MB by default), the mongos which caused the size increase will split the chunk.
That chunk is split on the shard the original chunk resided on. Since the config servers do not form a replica set, it has to update all three config servers.
If the difference in the number of chunks reaches the chunk migration threshold, the cluster balancer will initiate the migration process.
The balancer will try to acquire a balancing lock. Note that because of this there can be only one chunk migration at any given point in time. (This will become important in the explanation what happened.)
In order to acquire the balancing lock, the balancer needs to update the config on all three config servers.
If the balancing lock is acquired, the data will be copied over to the shard with the least chunks.
After the data is successfully copied to the destination, the balancer updates the key range to shard mapping on all three config servers
As a last step, the balancer notifies the source shard to delete the migrated chunk.

What happened in your case

Your data was increasing to a point where chunks had to be split. At this point, all three of your config servers were available and the metadata was updated accordingly. Chunks splitting is a relatively cheap operation in comparison to chunk migration and may happen often, so under load, you usually have a lot more chunk splits than migrations. As said before, only one chunk migration can happen at any given point in time.

Due to whatever problems, one or more of your config servers became unreachable after the chunks were split, but before enough chunks were migrated to balance them out below the chunk migration threshold, since there can only be one running chunk migration at any given point in time. Bottom line: Some time before every chunk that needed to be migrated was actually migrated, one or more of your config servers became unavailable.

Now the balancer wanted to migrate a chunk, but could not reach all config servers to acquire the global migration lock. Hence the error message.

How to deal with such situation

It is very likely that your config server were out of sync. In order to deal with this situation, please read Adam Comerfords answer to mongodb config servers not in sync. Follow it to the letter.

How to prevent

Plain and simple: use MMS. It is free, gives a lot of information about health and performance and using the automation agent, administration of a MongoDB cluster can be done a lot faster.

I tend to suggest to install at least 3 monitoring agents, so you can have a scheduled downtime on one while maintaining redundancy with the others.

MMS has alerting capabilities, so you will be notified if one of your config servers becomes unavailable, which is a serious situation.

Related Solutions

Mongodb: slow shard balancing

Math first: you moved 150M documents in 2 days, which is roughly 860 documents per second including metadata and indices, where reading and writing all occurs on the same machine. That is not what I would call slow. The description coming to my mind is "lightning fast". ;)

Since there is no real distribution of the write load, an easy way to speed things up is to add two or more machines.

A few notes: sharding production data on non-replica shards is dangerous, to say the least. If one of the shards fails, the data contained is permanently unavailable until you get the shard up and running again. Plus, since there was no server to write to for a specific key range, values of that range can not be written. If the shard was a replica set, the failure would lead to the election of a new primary, to which all writes and (depending on the configuration) most or even all reads would go.

_id can be used as a shard key, however it should be hashed.

Mongodb – how to split the chunks in sharding

You need to enable sharding on the database test first:

sh.enableSharding("test");

Then you need to shard the collection and pick a shard key. Based on what you were trying to split on, that would be:

sh.shardCollection("test.zp", {"city" : 1});

A couple of notes:

"city" is probably a poor shard key unless you are expecting an even distribution of records per city and you are going to have a lot of them. Take a look at the docs on picking a key
If you have data in the collection already, you will need to create an index on the "city" field before you can use it as a shard key. If the collection is empty, MongoDB will do it for you automatically when you use it as a shard key.