MongoDB 3.0 Performance – Indexes Maximum Limitation

indexmongodbmongodb-3.0performance

I have a mongo DB that store user's public messages.
Each message should be categorized by at least one TAG (3 max in an array)

The TAG can be anything up to 32 characters. ([0..9a..z])

the Tag field is not part of the message field, and a search will only be occurring on the Tag field. (i.e. not full text index)

Since there is almost infinite possibilities for the Tag, my though is that at some point the mongoDB index will be huge, with some Tag referencing millions of documents, while some others will only reference a few.

What it the best practice to keep performances high ? Are my worries not founded ?

P.S. :
– The server's RAM is 64GB.
– The message's date is also indexed (for sorting)

Best Answer

On your indices

Ok, first things first. Assuming some structure like this

{
  _id: new ObjectId(),
  date: new ISODate(),
  message: "Hello, Multikey Indices!",
  tags: ["MongoDB","Indices","Multikey"]
}

indexing tags would result in a multikey index. For the document above, the index would have three entries: "MongoDB", "Indices" and "Multikey", all pointing to the same document. The index would have around 200 bytes. Now let's assume the worst case scenario that each document had tags distinct of the tags of each other document, resulting in three new index entries per document. Even if we multiply those 200 bytes with 100,000,000, the result is just 20Gb.

And here comes the good stuff: MongoDB keeps indices in RAM as long as they fit. And for memory access, we are talking of somewhere of 75ns (mind you, a nanosecond is a billionth of a second). So even in our worst case scenario, our index will be searched pretty fast.

As for the sorting: As a rule of thumb, one should assume that MongoDB can use only one index per query. While this is not entirely true as there are index intersections as of MongoDB 2.6, it can be a rather delicate topic. Since the individual fields of compound indices can be used independently, it is pretty reasonable to create a compound index here. But: order matters. You query for tags and then sort by date, so your index should be created like this

db.collection.ensureIndex({tags:1,date:1})

in case you sort in ascending order when doing the query or this

db.collection.ensureIndex({tags:1,date:-1})

for a descending sort order.

That being said, we can now have a look how to ensure performance for a MongoDB cluster.

On the real problem (IMHO)

IMHO, your mistake is to assume that it is a technical question to ensure performance. I strongly disagree with that. There are a few things which can be done from the technical side (such as proper indexing), but at the very bottom, ensuring the performance of a MongoDB cluster is an administrative task. Let me explain this a bit more detailed.

From my experience, you need to adapt the dimensions of your hardware to your needs. If you need to have certain queries to be fast, that's easy to achieve. But if your hardware lacks the necessary resources, your options are either to scale up/out or accept performance impact. (In your scenario, your hardware seems to have sufficient RAM, btw.)

Here is what I do when I expect that I need to scale.

  1. For my initial hardware, I see that I get the most bang for the buck RAM- and IOPs-wise. This way, I am most cost efficient and scaling up does not make any sense any more.
  2. Since I know that my only cost efficient option is to scale out, I make a written plan for it (what to do, who has to make the "go" decision, how to migrate, when to put a maintenance window, etc) and discuss it with the stakeholders. If service interruption is not an option, I start with a sharded cluster with a single shard. That's often viable, since config servers can easily be run on extremely cheapo VMs, which you often can run for years (or even – admittedly theoretical – decades) before this approach becomes more expensive than a service interruption. I strongly suggest the latter approach, since scaling out becomes as easy as adding a shard to the cluster and you don't need to deal with the intricacies of sharding when you already are in sort of a pinch (data migration included for free, too). But make sure you have a written plan for this, too, agreed on by all stakeholders.
  3. I set alerts for disk and RAM usage. As a rule of thumb, an alert when either usage of reaches 85% should give you plenty of time to execute your scaling plan.

From my experience, this is the best practice to keep the performance of MongoDB high.