MongoDB as a log storage. Choosing shard key

loggingmongodbmulti-tenantsharding

I'm designing a log storage system based on MongoDB.
I want to shard a log collection to increase ingestion and capacity (distribute writes to several machines) while allow fast search.
I should be able to increase ingestion by adding more nodes to the cluster.

My collection has following fields:

Subsystem – string, name of the application. E.g: "SystemA", "SystemB". ~ 100 unique values.

Tenant – string, the name of the deployment. It's used to separate logs from different application deployments / environments. E.g: "South TEST", "North DEV", "South PROD", "North PROD". ~ 20 unique values.

Date – timestamp.

User – string.

SessionId – guid, logically groups several related log records.

Data – BLOB, contains zipped data. Average size = 2Kb, maximum = 8Mb.

Context – array of key/value pairs. Both key and value are strings. It's used to store additional metadata associated with event.

The search could be performed by any combination of fields Subsystem, Date, User and Context.
Tenant almost always will be specified.

The question is – what shard key and sharding strategy will be better in that case?

My suggestions:

The simplest case is to shard by Tenant, but it will cause highly uneven data distribution, because PROD environments generates much more logs than DEV.

"Tenant + Subsystem" seems to be better but still there are subsystems that generates much more logs than other subsystems.
And also subsystem is not mandatory – user can omit subsystem during search and search query will be broadcasted.

"SessionId" will cause even data distribution but search requests will be broadcasted to all nodes.

Best Answer

For even write distribution, the SessionId looks like a much better idea (Or a hashed index on other fileds). Proper indexing should solve queries then.

The other two options have very low cardinality (and therefore very large chunks).