MongoDB Sharding – Using Compound Shard Key with _id Field

mongodbmongodb-3.2sharding

I have documents like:

{_id: "someid1", "bar": "somevaluebar1"}
{_id: "someid2", "foo": "somevaluefoo2", "bar": "somevaluebar2"}
{_id: "someid3", "foo": "somevaluefoo3", "zoo": "somevaluezoo3"}
{_id: "someid4", "zoo": "somevaluezoo4"}

If we query documents by "foo" the most and "bar" the second, does it make sense to create a compound shard key like { "foo" : 1, "bar" : 1, "_id" : 1 } because "foo" and "bar" might be missing too?

When I tried to run this command

sh.shardCollection("<your-db>", {{ "foo" : 1, "bar" : 1, "_id" : 1 }:"hashed"})

it gave me a syntax error.

Best Answer

You'll need to rethink your shard key approach.

As at MongoDB 3.2:

All fields in a compound shard key must be present in all documents and will be immutable (i.e. the shard key for an existing document cannot be changed).
A hashed shard key is based on a single field, and does not support range queries.

It generally makes sense to have a shard key that supports your common queries so they can be targeted at a subset of shards with relevant data, but this doesn't appear to be possible in your case as both foo and bar are optional fields.

If your _id field provides good cardinality (i.e. large number of values) but is monotonically increasing (eg. default ObjectIDs) you could consider a hashed shard index on the _id field for good write distribution. The hashed index wouldn't support your common read queries (unless by specific _id values) so you would need a secondary index for your queries on foo and bar (i.e. {foo:1, bar:1}). The recommended secondary index(es) and order will depend on your common queries and sort order.

For further background information I suggest reviewing:

Related Solutions

MongoDB – Presplitting Chunks for Compound Shard Key

Just a slight issue with how you are passing the $minKey values in, try this instead:

db.adminCommand( { split : "mydb.mycollection" , middle : { "region" : "region1", "foo" : MinKey , "bar" : MinKey } } );
db.adminCommand( { split : "mydb.mycollection" , middle : { "region" : "region2", "foo" : MinKey , "bar" : MinKey } } );

This got me the following layout:

sh.status()
--- Sharding Status --- 
  sharding version: {
    "_id" : 1,
    "version" : 4,
    "minCompatibleVersion" : 4,
    "currentVersion" : 5,
    "clusterId" : ObjectId("53a2cd9d98b4ace818666544")
}
  shards:
    {  "_id" : "shard0000",  "host" : "localhost:30000" }
    {  "_id" : "shard0001",  "host" : "localhost:30001" }
    {  "_id" : "shard0002",  "host" : "localhost:30002" }
  databases:
    {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
    {  "_id" : "mydb",  "partitioned" : true,  "primary" : "shard0001" }
        mydb.mycollection
            shard key: { "region" : 1, "foo" : 1, "bar" : 1 }
            chunks:
                shard0000   1
                shard0001   2
            {
    "region" : { "$minKey" : 1 },
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} -->> {
    "region" : "region1",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} on : shard0000 Timestamp(2, 0) 
            {
    "region" : "region1",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} -->> {
    "region" : "region2",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} on : shard0001 Timestamp(2, 2) 
            {
    "region" : "region2",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} -->> {
    "region" : { "$maxKey" : 1 },
    "foo" : { "$maxKey" : 1 },
    "bar" : { "$maxKey" : 1 }
} on : shard0001 Timestamp(2, 3)

The use of the $minKey (MinKey) and $maxKey (MaxKey) values is a bit tough to tease out (they are rarely used except internally), but there is a decent, and illustrative example here in the docs.

Mongodb – Choosing shard key with compound index field

You should be able to use the index listed to cover the shard key. It is a super set of your shard key fields.

The shard key listed should be fine for distributing write load, given you don't expect any individual articleId/host pairs to take the bulk of your writes at a given point in time.

I would be concerned about this shard key for reads. In order to target a single shard for a query, you need to include the shard key values. My guess is your queries do not include timestamp. Without timestamp your queries will be sent to every shard which is inefficient. With scatter gather reads, your hamper your ability to scale reads by adding shards.

Best Answer

Related Solutions

MongoDB – Presplitting Chunks for Compound Shard Key

Mongodb – Choosing shard key with compound index field

Related Question