Mongodb – Mongo large chunks will not split

mongodbsharding

I had a 3 shard set up, and ran out of capacity, so I added 3 more shards. (Each shard is a replica set). But the data isn't spread across the cluster evenly. I have my chunkSize set to the standard 64mb:

mongos> db.settings.find( { _id:"chunksize" } )
{ "_id" : "chunksize", "value" : 64 }

I thought this meant that when a chunk hits 64mb, it splits into two equal chunks both of size 32mb. That's what is demonstrated here. Is that not correct?

Here is my sharding distribution:

mongos> db.accounts.getShardDistribution()
Shard rs_0 at rs_0/mongo_rs_0_member_1:27018,mongo_rs_0_member_2:27019,mongo_rs_0_member_3:27020
 data : 137.62GiB docs : 41991598 chunks : 1882
 estimated data per chunk : 74.88MiB
 estimated docs per chunk : 22312

Shard rs_1 at rs_1/mongo_rs_1_member_1:27018,mongo_rs_1_member_2:27019,mongo_rs_1_member_3:27020
 data : 135.2GiB docs : 41159069 chunks : 1882
 estimated data per chunk : 73.56MiB
 estimated docs per chunk : 21869

Shard rs_2 at rs_2/mongo_rs_2_member_1:27018,mongo_rs_2_member_2:27019,mongo_rs_2_member_3:27020
 data : 219.92GiB docs : 69739096 chunks : 1882
 estimated data per chunk : 119.66MiB
 estimated docs per chunk : 37055

Shard rs_3 at rs_3/mongo_rs_3_member_1:27018,mongo_rs_3_member_2:27019,mongo_rs_3_member_3:27020
 data : 101.52GiB docs : 30650628 chunks : 1882
 estimated data per chunk : 55.23MiB
 estimated docs per chunk : 16286

Shard rs_4 at rs_4/mongo_rs_4_member_1:27018,mongo_rs_4_member_2:27019,mongo_rs_4_member_3:27020
 data : 103.38GiB docs : 31071379 chunks : 1883
 estimated data per chunk : 56.22MiB
 estimated docs per chunk : 16500

Shard rs_5 at rs_5/mongo_rs_5_member_1:27018,mongo_rs_5_member_2:27019,mongo_rs_5_member_3:27020
 data : 101.1GiB docs : 30516395 chunks : 1881
 estimated data per chunk : 55.04MiB
 estimated docs per chunk : 16223

Totals
 data : 798.77GiB docs : 245128165 chunks : 11292
 Shard rs_0 contains 17.23% data, 17.13% docs in cluster, avg obj size on shard : 3KiB
 Shard rs_1 contains 16.92% data, 16.79% docs in cluster, avg obj size on shard : 3KiB
 Shard rs_2 contains 27.53% data, 28.45% docs in cluster, avg obj size on shard : 3KiB
 Shard rs_3 contains 12.7% data, 12.5% docs in cluster, avg obj size on shard : 3KiB
 Shard rs_4 contains 12.94% data, 12.67% docs in cluster, avg obj size on shard : 3KiB
 Shard rs_5 contains 12.65% data, 12.44% docs in cluster, avg obj size on shard : 3KiB

What's up with this? How can the first 3 shards/replica sets have an average size greater than 64mb when that is set to be chunkSize? Rs_2 is 119mb! Rs_2 has 27.53% of the data when it should have 16.6%.

I have very high cardinality in my shardkey, and it is not in a monotonically increasing one.

What should I do here? I can manually find chunks that are large and split them, but that is a pain. Do I lower my chunkSize? Is there some service/call I need to run to do this automatically?

Best Answer

Lot to go through here, so I'll take it piece by piece, first off splitting:

I thought this meant that when a chunk hits 64mb, it splits into two equal chunks both of size 32mb. That's what is demonstrated here. Is that not correct?

That's not quite how it works. If you have a 64MB chunk and you manually run a splitFind command, you will get (by default) 2 chunks split at the mid-point. Auto splitting is done differently though - the details are actually quite involved, but use what I explain as a rule of thumb and you will be close enough.

Each mongos tracks how much data it has seen inserted/updated for each chunk (approximately). When it sees that ~20% of the maximum chunk size (so 12-13MiB by default) has been written to a particular chunk it will attempt an automatic split of that chunk. It sends a splitVector command to the primary that owns the chunk asking for it to evaluate the chunk range and return any potential split points. If the primary replies with valid points, then the mongos will attempt to split on those points. If there are no valid split points, then the mongos will retry this process when it the updates/writes get to 40%, 60% of the max chunk size.

As you can see, this does not wait for a chunk to reach the max size before splitting, in fact it should happen long before that and with a normally operating cluster you should not see such large chunks in general.

What's up with this? How can the first 3 shards/replica sets have an average size greater than 64mb when that is set to be chunkSize? Rs_2 is 119mb!

The only thing preventing large chunks from occurring is the auto-split functionality described above. Your average chunk sizes suggest that something is preventing the chunks from being split. There are a couple of possible reasons for this, but the most common is that the shard key is not granular enough.

If your chunk ranges get down to a single key value then no further splits are possible and you get "jumbo" chunks. I would need to see the ranges to be sure, but you can probably manually inspect them easily enough from sh.status(true) but for a more easily digestible version take a look at this Q&A I posted about determining the chunk distribution.

If that is the issue you only really have 2 choices - either live with the jumbo chunks (and possibly increase the max chunk size to allow them to move around - anything over the max will be aborted and tagged as "jumbo" by the mongos), or re-shard the data with a more granular shard key that prevents the creation of single key chunks.

Rs_2 has 27.53% of the data when it should have 16.6%.

This is a fairly common misconception about the balancer - it does not balance based on the data size, it just balances the number of chunks (which you can see are nicely distributed) - from that perspective a chunk with 0 documents in it counts just the same as one with 250k documents. Hence, the reason for the imbalance in terms of the data is because of the imbalance in the chunks themselves (some contain a lot more data than others).

What should I do here? I can manually find chunks that are large and split them, but that is a pain. Do I lower my chunkSize?

Lowering the chunk size would cause the mongos to check for split points more frequently, but it won't help if the splits are failing (which your chunk size averages suggest is the case), it will just fail more often. As a first step I would find the largest chunks (see the Q&A link above) and split those as a priority first.

If you are going to do any manual splitting or moving, I recommend turning off the balancer so that it is not holding the meta data lock and does not kick in as soon as you start splitting. It's also generally a good idea to do it at a low traffic time because otherwise the auto-splitting I described above could interfere also.

After a quick search I don't have anything generic immediately to hand, but I have in the past seen scripts used to automate this process. It tends to need to be customized to fit the particular issue (imagine an imbalance due to a monotonic shard key versus an issue with chunk data density for example).

Related Solutions

Mongodb shard chunk migration 500GB takes 13 days – Is this slow or normal

Update: April 2018

This answer was correct at the time of the question, but things have moved on since then. Since version 3.4 parallelism has been introduced, and the ticket I referenced originally has been closed. For more information I cover some of the details in this more recent answer. I will leave the rest of the answer as-is because it remains a good reference for general issues/constraints as well as valid for anyone on an older version.

Original Answer

I give a full explanation of what happens with a chunk migration in the M202 Advanced course if you are interested. In general terms, let's just say that migrations are not very fast, even for empty chunks, because of the housekeeping being performed to make sure migrations work in an active system (these still happen even if nothing but balancing is happening).

Additionally, there is only one migration happening at a time on the entire cluster - there is no parallelism. So, despite the fact that you have two "full" nodes and two "empty" nodes, at any given time there is at most one migration happening (between the shard with the most chunks and the shard with the least). Hence, having added 2 shards gains you nothing in terms of balancing speed and just increases the number of chunks which have to be moved.

For the migrations themselves, the chunks are likely ~30MiB in size (depends on how you populated data, but generally this will be your average with the default max chunk size). You can run db.collection.getShardDistribution() for some of that information, and see my answer here for ways to get even more information about your chunks.

Since there is no other activity going on, for a migration to happen the target shard (one of the newly added shards) will need to read ~30MiB of data from the source shards (one of the original 2) and update the config servers to reflect the new chunk location once it is done. Moving 30MiB of data should not be much of a bottleneck for a normal system without load.

If it is slow, there are a number of possible reasons why that is the case, but the most common for a system that is not busy are:

Source Disk I/O - if the data is not in active memory when it is read, it must be paged in from disk
Network - if there is latency, rate limiting, packet loss etc. then the read may take quite a while
Target Disk I/O - the data and indexes have to be written to disk, lots of indexes can make this worse, but usually this is not a problem on a lightly loaded system
Problems with migrations causing aborts and failed migrations (issues with config servers, issues with deletes on primaries)
Replication lag - for migrations to replica sets, write concern w:2 or w:majority is used by default and requires up to date secondaries to satisfy it.

If the system was busy then memory contention, lock contention would usually be suspects here too.

To get more information about how long migrations are taking, if they are failing etc., take a look at the entries in your config.changelog:

// connect to mongos
use config
db.changelog.find()

As you have seen, and as I generally tell people when I do training/education, if you know you will need 4 shards, then it's usually better to start with 4 rather than ramp up. If you do, then you need to be aware that adding a shard can take a long time, and initially is a net negative on resources rather than a gain (see part II of my sharding pitfalls series for a more detailed discussion of that).

Finally, to track/upvote/comment on the feature request to improve the parallelism of chunk migrations, check out SERVER-4355

Mongodb – Chunk Size is shown as 1 KB in mongo db logs even it is set to 300 MB

How did you change the chunk size? Did you follow this guide?

To recap, one should:

connect to mongos
switch to config db
issue change chunk size command:
db.settings.save( { _id:"chunksize", value: <sizeInMB> } )

AFAIK,

warning: chunk is larger than 1024 bytes because of key ...

won't bring down a shard, it will only result in a unbalanced cluster.

Best Answer

Related Solutions

Mongodb shard chunk migration 500GB takes 13 days – Is this slow or normal

Mongodb – Chunk Size is shown as 1 KB in mongo db logs even it is set to 300 MB

Related Question