I have a mongo DB that store user's public messages.
Each message should be categorized by at least one TAG (3 max in an array)
The TAG can be anything up to 32 characters. ([0..9a..z])
the Tag field is not part of the message field, and a search will only be occurring on the Tag field. (i.e. not full text index)
Since there is almost infinite possibilities for the Tag, my though is that at some point the mongoDB index will be huge, with some Tag referencing millions of documents, while some others will only reference a few.
What it the best practice to keep performances high ? Are my worries not founded ?
P.S. :
– The server's RAM is 64GB.
– The message's date is also indexed (for sorting)
Best Answer
On your indices
Ok, first things first. Assuming some structure like this
indexing
tags
would result in a multikey index. For the document above, the index would have three entries: "MongoDB", "Indices" and "Multikey", all pointing to the same document. The index would have around 200 bytes. Now let's assume the worst case scenario that each document had tags distinct of the tags of each other document, resulting in three new index entries per document. Even if we multiply those 200 bytes with 100,000,000, the result is just 20Gb.And here comes the good stuff: MongoDB keeps indices in RAM as long as they fit. And for memory access, we are talking of somewhere of 75ns (mind you, a nanosecond is a billionth of a second). So even in our worst case scenario, our index will be searched pretty fast.
As for the sorting: As a rule of thumb, one should assume that MongoDB can use only one index per query. While this is not entirely true as there are index intersections as of MongoDB 2.6, it can be a rather delicate topic. Since the individual fields of compound indices can be used independently, it is pretty reasonable to create a compound index here. But: order matters. You query for tags and then sort by date, so your index should be created like this
in case you sort in ascending order when doing the query or this
for a descending sort order.
That being said, we can now have a look how to ensure performance for a MongoDB cluster.
On the real problem (IMHO)
IMHO, your mistake is to assume that it is a technical question to ensure performance. I strongly disagree with that. There are a few things which can be done from the technical side (such as proper indexing), but at the very bottom, ensuring the performance of a MongoDB cluster is an administrative task. Let me explain this a bit more detailed.
From my experience, you need to adapt the dimensions of your hardware to your needs. If you need to have certain queries to be fast, that's easy to achieve. But if your hardware lacks the necessary resources, your options are either to scale up/out or accept performance impact. (In your scenario, your hardware seems to have sufficient RAM, btw.)
Here is what I do when I expect that I need to scale.
From my experience, this is the best practice to keep the performance of MongoDB high.