MongoDB – Querying Sharded and Unsharded Collections/Databases

mongodbsharding

I am using mongodb (3.4). I have six databases. I am in the process of sharding a particularly large collection (call it 'user_posts') in one of them. I like the idea of sharding all collections in all six databases, because it improves redundancy and allows for horizontal scaling. But because I am currently terrified of shard keys, I do not want to rush this process. So, I want to shard the collection that I must, and deal with the other collections later.

So, I will have two mongo servers, where one is just a shard of the 'user_posts' collection, and the other is a shard of the 'user_posts' collection but also contains some standalone databases.

Will mongos be able to query all six databases in this scenario?

Thanks

Best Answer

A MongoDB sharded cluster deployment can contain collections that are either unsharded (the default when created) or sharded (based on your chosen shard key). Sharding a collection is (as at MongoDB 3.4) a two-step process: enable sharding on a database using sh.enableSharding() and then shard specific collections using sh.shardCollection(). Sharded and unsharded collections coexist in the same sharded cluster and your application generally does not have to be aware of whether collections are sharded.

If you have a sharded cluster deployment (irrespective of whether you have enabled sharding on individual collections) you should always make your queries and updates through a mongos instance. The mongos instances handle details of routing queries to the correct shard(s) based on the current configuration and status of backing mongod servers.

The "generally unaware" application design caveat alluded to earlier is that a few operations (for example: $lookup and $graphLookup) may not support using sharded collections as the source or destination. There are also some operational restrictions in a sharded cluster which are worth reviewing.

I like the idea of sharding all collections in all sixdatabases, because it improves redundancy and allows for horizontal scaling

Sharding alone does not improve redundancy. Sharding is used to partition data across multiple servers so you can have multiple MongoDB instances presented as a single logical deployment. Ideally your shards should be backed by replica sets so you have data redundancy and failover (which both rely on replication). I would strongly recommend upgrading a production deployment to a replica set before moving to a sharded cluster.

If you plan on eventually sharding it would be reasonable to get your sharded cluster infrastructure in place so you can minimise disruptions in future. For example, moving from a replica set to a sharded cluster deployment involves setting up config servers, mongos instances, and changing your connection seed list from replica set members to mongos instances.

because I am currently terrified of shard keys, I do not want to rush this process. So, I want to shard the collection that I must, and deal with the other collections later.

Is the collection that you must shard stressing your current deployment? Before moving to a larger deployment with more moving parts I would encourage you to try to understand and tune your existing deployment. There may be some easy wins that could improve performance ... or conversely, unfortunate design choices that won't scale well in a sharded environment. Helpful starting points include the MongoDB Production Notes, Operations Checklist, and Security Checklist.

If you have good understanding or estimation of your data model and distribution patterns, I would suggest sharding in a test environment using representative data. There are many helpful tools for generating fake (but probabilistic) data if needed. For an example recipe, see: Stack Overflow: duplicate a collection into itself.

So, I will have two mongo servers, where one is just a shard of the 'user_posts' collection, and the other is a shard of the 'user_posts' collection but also contains some standalone databases.

Assuming you are describing a sharded cluster with two shards, that is correct. Sharded collections will be distributed across shards while unsharded collections will live on a primary shard for each database.

Will mongos be able to query all six databases in this scenario?

mongos provides access to all databases in the same sharded cluster.