MongoDB – How to Choose a Randomly Distributed Shard Key

mongodbsharding

In one of my readings i have noticed following statement for choosing randomly distributed shard keys but not able to understand why is it said so. Could someone provide me with detailed explanation on this.

"The only downside to randomly distributed shard keys is that MongoDB isn’t efficient at randomly accessing data beyond the size of RAM."

Thank you.

Best Answer

While it's hard to say for certain without full context, I'm assuming its referring to the need to keep the working set in memory.

A randomly distributed shard key would distribute the workload across an entire index, meaning that the entire index would need to be fit in memory to efficiently handle the workload. Performance would deteriorate once the size of this index on a shard grows larger than RAM, as the index on the shard key would need to be page faulting data in and out of memory.

In contrast, a non-random shard key may have a "hot" subset that handles most of the working set. For example, consider a website where only newer "posts" by users are frequently accessed and older "posts" are rarely accessed. While the indexes on "posts" may be larger than available memory, only subsets of the indexes may need to fit in memory, reducing memory pressure and the potential of page faults.

Related Solutions

Mongodb – Choosing shard key and friendly URL Ids for the MongoDB

Before you go further you need to answer a few questions

how do you represent files within folders into the database
how do you represent folders
do you have relations between folders (parent -> child)
how often do you expect to create folders and files
how often do you update existing files into folders and what is the number of files you update

Based on your answers you can have a write optimized schema or a read optimized schema. Write optimized is a schema that contains many entries that are very small or you can use built in operators like $inc over a collection. Read optimized is generally a larger collection like the one you described, into your scenario you could have very easy something like this (assuming all folders are at the same level)

{ "userid" : "email or id",
  [ 
     { "folder1" : [ "file1", "file2"] },  
     { "folder2" : [ "file3", "file4"] },
  ]
}

But with this schema it gets quite complicated if you need to link a folder to a parent folder ... But is obvious that the userid is the shard key.

MongoDB – Presplitting Chunks for Compound Shard Key

Just a slight issue with how you are passing the $minKey values in, try this instead:

db.adminCommand( { split : "mydb.mycollection" , middle : { "region" : "region1", "foo" : MinKey , "bar" : MinKey } } );
db.adminCommand( { split : "mydb.mycollection" , middle : { "region" : "region2", "foo" : MinKey , "bar" : MinKey } } );

This got me the following layout:

sh.status()
--- Sharding Status --- 
  sharding version: {
    "_id" : 1,
    "version" : 4,
    "minCompatibleVersion" : 4,
    "currentVersion" : 5,
    "clusterId" : ObjectId("53a2cd9d98b4ace818666544")
}
  shards:
    {  "_id" : "shard0000",  "host" : "localhost:30000" }
    {  "_id" : "shard0001",  "host" : "localhost:30001" }
    {  "_id" : "shard0002",  "host" : "localhost:30002" }
  databases:
    {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
    {  "_id" : "mydb",  "partitioned" : true,  "primary" : "shard0001" }
        mydb.mycollection
            shard key: { "region" : 1, "foo" : 1, "bar" : 1 }
            chunks:
                shard0000   1
                shard0001   2
            {
    "region" : { "$minKey" : 1 },
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} -->> {
    "region" : "region1",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} on : shard0000 Timestamp(2, 0) 
            {
    "region" : "region1",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} -->> {
    "region" : "region2",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} on : shard0001 Timestamp(2, 2) 
            {
    "region" : "region2",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} -->> {
    "region" : { "$maxKey" : 1 },
    "foo" : { "$maxKey" : 1 },
    "bar" : { "$maxKey" : 1 }
} on : shard0001 Timestamp(2, 3)

The use of the $minKey (MinKey) and $maxKey (MaxKey) values is a bit tough to tease out (they are rarely used except internally), but there is a decent, and illustrative example here in the docs.

Best Answer

Related Solutions

Mongodb – Choosing shard key and friendly URL Ids for the MongoDB

MongoDB – Presplitting Chunks for Compound Shard Key

Related Question