Before you go further you need to answer a few questions
- how do you represent files within folders into the database
- how do you represent folders
- do you have relations between folders (parent -> child)
- how often do you expect to create folders and files
- how often do you update existing files into folders and what is the number of files you update
Based on your answers you can have a write optimized schema or a read optimized schema. Write optimized is a schema that contains many entries that are very small
or you can use built in operators like $inc over a collection. Read optimized is generally a larger collection like the one you described, into your scenario you could have very easy something like this (assuming all folders are at the same level)
{ "userid" : "email or id",
[
{ "folder1" : [ "file1", "file2"] },
{ "folder2" : [ "file3", "file4"] },
]
}
But with this schema it gets quite complicated if you need to link a folder to a parent folder ... But is obvious that the userid is the shard key.
Just a slight issue with how you are passing the $minKey
values in, try this instead:
db.adminCommand( { split : "mydb.mycollection" , middle : { "region" : "region1", "foo" : MinKey , "bar" : MinKey } } );
db.adminCommand( { split : "mydb.mycollection" , middle : { "region" : "region2", "foo" : MinKey , "bar" : MinKey } } );
This got me the following layout:
sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("53a2cd9d98b4ace818666544")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
{ "_id" : "shard0001", "host" : "localhost:30001" }
{ "_id" : "shard0002", "host" : "localhost:30002" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "mydb", "partitioned" : true, "primary" : "shard0001" }
mydb.mycollection
shard key: { "region" : 1, "foo" : 1, "bar" : 1 }
chunks:
shard0000 1
shard0001 2
{
"region" : { "$minKey" : 1 },
"foo" : { "$minKey" : 1 },
"bar" : { "$minKey" : 1 }
} -->> {
"region" : "region1",
"foo" : { "$minKey" : 1 },
"bar" : { "$minKey" : 1 }
} on : shard0000 Timestamp(2, 0)
{
"region" : "region1",
"foo" : { "$minKey" : 1 },
"bar" : { "$minKey" : 1 }
} -->> {
"region" : "region2",
"foo" : { "$minKey" : 1 },
"bar" : { "$minKey" : 1 }
} on : shard0001 Timestamp(2, 2)
{
"region" : "region2",
"foo" : { "$minKey" : 1 },
"bar" : { "$minKey" : 1 }
} -->> {
"region" : { "$maxKey" : 1 },
"foo" : { "$maxKey" : 1 },
"bar" : { "$maxKey" : 1 }
} on : shard0001 Timestamp(2, 3)
The use of the $minKey
(MinKey) and $maxKey
(MaxKey) values is a bit tough to tease out (they are rarely used except internally), but there is a decent, and illustrative example here in the docs.
Best Answer
While it's hard to say for certain without full context, I'm assuming its referring to the need to keep the working set in memory.
A randomly distributed shard key would distribute the workload across an entire index, meaning that the entire index would need to be fit in memory to efficiently handle the workload. Performance would deteriorate once the size of this index on a shard grows larger than RAM, as the index on the shard key would need to be page faulting data in and out of memory.
In contrast, a non-random shard key may have a "hot" subset that handles most of the working set. For example, consider a website where only newer "posts" by users are frequently accessed and older "posts" are rarely accessed. While the indexes on "posts" may be larger than available memory, only subsets of the indexes may need to fit in memory, reducing memory pressure and the potential of page faults.