Mongodb – Sharding Mongodb while limiting the number of documents to be migrated

mongodbsharding

I have little experience working with databases, but at my work I was tasked to deploy a sharded Mongodb cluster and the constrain on the shard is:

database with multiple collections, 7 collections will live on primary shard, 25 will be sharded evenly and one ( that has over 2.5 billion documents, 2.3Tb) will have 2Tb on primary and 0.3Tb divided withing the other clusters.
I have been reading the doc's from Mongodb website, tutorials on the internet and posted questions on different forums with no luck so far.

I understand that if you don't enable shard on the collection, it will live on only one collection. this will take care on the collections that will stay on the primary server.

Also, adding shard on the collection, will make the collection to shard evenly (or close to it), and that takes care of the 25 collections that can live on any shard.

Now for the last one I was thinking of a compound key using the id field (which is a unique field) together with "in_Use" field which is a boolean to divide the collection as in B=true stays on the primary shard and MOST, but not all, B=false goes tho the other shards.

A second problem with this collection is that the B=true values are changed to false after 24 hours and new documents with the value true are created, so they would needed to be swapped everyday. This solution I couldn't find on any documentation or tutorial, so I am not sure if it is doable.

Thirdly, since we have a size restriction on all shards except the primary one, we will have to restrict the size of the shards, according to the documentation, this can be accomplished by adding "maxSize" option to the addShard command, but i believe that would be set to all collections for the database and it might create a problem if more space are added dynamically. The shard might not increase it's "maxSize" dynamically.

Any ideas or suggestions are welcomed.

Best Answer

  • For your 7 collections on the primary shard:

    Enter as administrator on the primary replica (if you shard your replica) and create your collection there by inserting 1 document or creating an index there. When you create a collection via the mongos - shard client - then the collection is started on a random shard; if you create it on the shard itself first, then you know it's on the one you want. shard1$ mongo --port 27018 localhost/mydb --eval 'db.mycol.insert({firstdocument:'hello'})'

  • For the 25 collections:

    Standard sharding indeed as you mention. Be patient with the distribution. The first shards will start distributing after a certain amount of data, can be 100k documents if the documents are small.

  • For the big collection:

    Have a look at shard-tags. So, you have to tag your collection on shard1 with let's say : {status:'in_use'}; shard2 is tagged {status:'freeBeer'}. Then depending on the status in the document, the shard is chosen.

  • For the disk issue, I think you gonna have to write a script. Get a warning when 60% of a disk is used, and another warning at 80% to take action. Then you'll need to upgrade your disk or redistribute the chunks. Certainly the 2TB collection could give you space via the tag, to send more to shard2. Or, add a shard3 ... what is actually the purpose of sharding, to grow horizontally.

  • Choose shardKeys well, practice configuration and setup on a small server, practice moving chunks, adding tags, ....

  • Have a look at the AllChunkInfo script of Adam Comerford for you scripts and testing : https://github.com/comerford/mongodb-scripts