MongoDB as a log storage. Choosing shard key

loggingmongodbmulti-tenantsharding

I'm designing a log storage system based on MongoDB.
I want to shard a log collection to increase ingestion and capacity (distribute writes to several machines) while allow fast search.
I should be able to increase ingestion by adding more nodes to the cluster.

My collection has following fields:

Subsystem – string, name of the application. E.g: "SystemA", "SystemB". ~ 100 unique values.

Tenant – string, the name of the deployment. It's used to separate logs from different application deployments / environments. E.g: "South TEST", "North DEV", "South PROD", "North PROD". ~ 20 unique values.

Date – timestamp.

User – string.

SessionId – guid, logically groups several related log records.

Data – BLOB, contains zipped data. Average size = 2Kb, maximum = 8Mb.

Context – array of key/value pairs. Both key and value are strings. It's used to store additional metadata associated with event.

The search could be performed by any combination of fields Subsystem, Date, User and Context.
Tenant almost always will be specified.

The question is – what shard key and sharding strategy will be better in that case?

My suggestions:

The simplest case is to shard by Tenant, but it will cause highly uneven data distribution, because PROD environments generates much more logs than DEV.

"Tenant + Subsystem" seems to be better but still there are subsystems that generates much more logs than other subsystems.
And also subsystem is not mandatory – user can omit subsystem during search and search query will be broadcasted.

"SessionId" will cause even data distribution but search requests will be broadcasted to all nodes.

Best Answer

For even write distribution, the SessionId looks like a much better idea (Or a hashed index on other fileds). Proper indexing should solve queries then.

The other two options have very low cardinality (and therefore very large chunks).

Related Solutions

Mongodb – Choosing shard key and friendly URL Ids for the MongoDB

Before you go further you need to answer a few questions

how do you represent files within folders into the database
how do you represent folders
do you have relations between folders (parent -> child)
how often do you expect to create folders and files
how often do you update existing files into folders and what is the number of files you update

Based on your answers you can have a write optimized schema or a read optimized schema. Write optimized is a schema that contains many entries that are very small or you can use built in operators like $inc over a collection. Read optimized is generally a larger collection like the one you described, into your scenario you could have very easy something like this (assuming all folders are at the same level)

{ "userid" : "email or id",
  [ 
     { "folder1" : [ "file1", "file2"] },  
     { "folder2" : [ "file3", "file4"] },
  ]
}

But with this schema it gets quite complicated if you need to link a folder to a parent folder ... But is obvious that the userid is the shard key.

MongoDB presplitting chunks for compound shard key

Just a slight issue with how you are passing the $minKey values in, try this instead:

db.adminCommand( { split : "mydb.mycollection" , middle : { "region" : "region1", "foo" : MinKey , "bar" : MinKey } } );
db.adminCommand( { split : "mydb.mycollection" , middle : { "region" : "region2", "foo" : MinKey , "bar" : MinKey } } );

This got me the following layout:

sh.status()
--- Sharding Status --- 
  sharding version: {
    "_id" : 1,
    "version" : 4,
    "minCompatibleVersion" : 4,
    "currentVersion" : 5,
    "clusterId" : ObjectId("53a2cd9d98b4ace818666544")
}
  shards:
    {  "_id" : "shard0000",  "host" : "localhost:30000" }
    {  "_id" : "shard0001",  "host" : "localhost:30001" }
    {  "_id" : "shard0002",  "host" : "localhost:30002" }
  databases:
    {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
    {  "_id" : "mydb",  "partitioned" : true,  "primary" : "shard0001" }
        mydb.mycollection
            shard key: { "region" : 1, "foo" : 1, "bar" : 1 }
            chunks:
                shard0000   1
                shard0001   2
            {
    "region" : { "$minKey" : 1 },
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} -->> {
    "region" : "region1",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} on : shard0000 Timestamp(2, 0) 
            {
    "region" : "region1",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} -->> {
    "region" : "region2",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} on : shard0001 Timestamp(2, 2) 
            {
    "region" : "region2",
    "foo" : { "$minKey" : 1 },
    "bar" : { "$minKey" : 1 }
} -->> {
    "region" : { "$maxKey" : 1 },
    "foo" : { "$maxKey" : 1 },
    "bar" : { "$maxKey" : 1 }
} on : shard0001 Timestamp(2, 3)

The use of the $minKey (MinKey) and $maxKey (MaxKey) values is a bit tough to tease out (they are rarely used except internally), but there is a decent, and illustrative example here in the docs.

Related Question