MongoDB Sharding – Enable Sharding for Database Takes Five to Six Minutes

mongodbmongodb-3.0mongodb-3.2

I'm seeing some odd behavior when invoking the enableSharding command for a database. First a bit of background:

I'm using MongoDB v3.2.7 and the Java driver. Currently I have only one replica set (three nodes) which represents one shard (just one shard for – plans to scale horizontally in the future). This is front ended by three mongos nodes.

The application leveraging this MongoDB "cluster" creates thousands (greater than 10K) of databases. Inside of each database is roughly 15 collections. The databases and collections are created and initialized at runtime by the application as needed. Part of the initialization is to invoke "enableSharding" on the database and then create the collections and finally shard the collections. Normal behavior for the application also includes dropping and creating this databases somewhat frequently.

This implementation is not yet in production but still in development phase. We are trying to mimic the MongoDB architecture that we plan on going to production with. However, it seems about twice a month the primary member of the shard/replica set will begin taking five to six minutes to complete the enableSharding command. The command does complete, but associated application logic "times out". All other operations on the primary node proceed as normal. There seems to be no impact to queries or writes. It also continues to act as the "primary". Again the enableSharding command completes – it just takes five to six minutes.

One other notable item is that we use "listDatbases" as part of our monitoring. I do notice that the call to "listDatabases" also begins to slow down when the enableSharding command for a database begins to take five to six minutes to complete. I guess a few questions would be:

Has anyone else observed this issue? Or maybe something related?
Should our application be initializing databases and collections at runtime so frequently?
Is the greater than 10K databases * 15 collections going to come back and bite us?

Mainly I'm just wondering why enableSharding for a new database would slow down so much. The only way we can clear it up is to restart the replica set member.

Thanks,
Terence

Best Answer

The enableSharding command is basically a meta data operation that changes one document in the databases collection, and that happens on the config servers (at least it was the last time I checked it in detail in 2.6). If this is slow, it's because you have contention there (at the config servers), likely due to activity from the other collections being sharded and balanced.

Each initial collection sharding, chunk migration, balancer lock will hit various tables on the config servers, and potentially contend with your enable sharding command (the meta data operations are special cases too, with 2 phase commit etc.). You can look at scaling up the config databases (will only work if they are struggling with resources based on your current load, not if you are hitting locking contention), or if this is only a once-off and not indicative of ongoing load, then checking for completion of other operations, balanced collections etc. will make your initial configuration run slower but less prone to timeout and error. The other option would be to alter the timeout in the code to wait longer for the command to complete, of course.