MongoDB scaling when working with files

mongodb

I have read through the MongoDB manual about sharding and replicasets.

However, I would like to know if the following can be achieved with sufficient performance (read/write):

Saving 10,000,000 files via GridFS on MongoDB instances
Total file size of about 2TB without indexes and journals
10,000 writes / day
10,000 reads / day
Querying for a document does not need respond instantly, 2 seconds is still acceptable

Setup in mind:

1 Replicaset of 3 nodes
Each node with 32GB Ram, 2TB SSD
1 Mongos instance
1 Config server

In the case the proposed setup would not suffice to obtain the goal of storing 10,000,000 documents and writing/reading 10,000 a day with fairly good performance, what would be the next step to take?

Adding more RAM to each node
Adding more disk capacity to each node
Adding Another replica set with the same configuration
Sharding

I would like to avoid sharding as much as possible since I consider it best practice to avoid sharding whenever possible.
I feel this would only complicate the topology unnecessarily and this would most likely be overkill in this case

Some advice would be much welcome.

Cheers

Best Answer

My answer is : It depends. If you are accessing files by _id field, which is already indexed then you don't need to add more memory soon.

The _id field which is type of ObjectID is 12 byte in size. That means it can hold up to 2^(12*8) files. 3 byte is for machine ID which is a hash value and has a fixed vale on the machine can be subtracted which gives you approx 2^72 files. For the reference, 2^20 is 1,048,576.

In terms of the memory, the index on the _id field needs 10,000,000 x 12 byte = 114 MiBytes. To be honest, I don't now how much overhead there will be for an index which holds 10 millions value but I don't think that it will need more than 1 Gigabyte.

Now, if your _id field is not a type of ObjectID than do the math.

In the gridfs, filename value of the files collection is also indexed. If you are not accessing files using filename, then you may leave it blank and drop the index for the filename.

On the other side, if you will add some metadata to the files you added and want to query the files according to those metadata, then you should have indexes for those metadata and do the math again.

I have a production environment which has over 3,000,000 pdf files (takes 180 Gig space on the disk). My server is a virtual server which has 4 vCPU and 4 Gig RAM, still no problem. The specs you provided is way too high for your needs. You can save billions of files with those servers. Especially if you have SSD. Because even if your indexes do not fit into the memory, swapping will be very fast, you won't even notice a slowdown.

The Background Case

The general take on the practice recommendation for co-locating the mongos process along with the application instance is to obviate any network overhead required in order for the application to communicate with that mongos process. Of course it is also "recommended practice" to specify a number of mongos instances in the application connection string in the case where that "nearest" node should not be available for some reason then another could be selected, albeit with the possible overhead of contacting a remote node.

The "docker" case you mentions seems somewhat arbitrary. While it is true that one of the primary goals of containers ( and before that, something like BSD jails or even chroot ) is generally to achieve some level of "process isolation", there is nothing really wrong with running multiple processes as long as you understand the implications.

In this particular case the mongos is meant to be "lightweight" and run as an "additional function" to the application process in a way that it is pretty much a "paired" part of the application itself. So docker images themselves don't have an "initd" like process but there is not really anything wrong with with running a process controller like supervisord ( for example ) as the main process for the container which then gives you a point of process control over that container as well. This situation of "paired processes" is a reasonable case and also a common enough ask that there is official documentation for it.

If you chose that kind of "paired" operation for deployment, then it does indeed address the primary point of maintaining a mongos instance on the same network connection and indeed "server instance" as the application server itself. It can also be viewed in some way as a case where the "whole container" were to fail then that node in itself would simply be invalid. Not that I would recommend it, and in fact you probably should still configure connections to look for other mongos instances even if these are only accessible over a network connection that increases latency.

Version Specific / Usage Specific

Now that that point is made, the other consideration here comes back to that initial consideration of co-locating the mongos process with the application for network latency purposes. In versions of MongoDB prior to 2.6 and specifically with regard to operations such as with the aggregation framework, then the case there was that there would be a lot more network traffic and subsequent after processing work performed by the mongos process for dealing with data from different shards. That is not so much the case now as a good deal of the processing workload can now be performed on those shards themselves before "distilling" to the "router".

The other case is your application usage patterns itself with regard to the sharding. That means whether the primary workload is in "distributing the writes" across multiple shards, or indeed being a "scatter-gather" approach in consolidating read requests. In those scenarios

Test, Test and then Test Again

So the final point here is really self explanatory, and comes down to the basic consensus of any sane response to your question. This is not a new thing for MongoDB or any other storage solution, but your actual deployment environment needs to be tested on it's "usage patterns" as close to actual reality just as much as any "unit testing" of expected functionality from core components or overall results needs to be tested.

There really is not "definitive" statement to say "configure this way" or "use in this way" that actually makes sense apart from testing what "actually works best" for your application performance and reliability as is expected.

Of course the "best case" will always be to not "crowd" the mongos instances with requests from "many" application server sources. But then to allow them some natural "parity" that can be distributed by the resource workloads available to having at "least" a "pool of resources" that can be selected, and indeed ideally in many cases but obviating the need to induce an additional "network transport overhead".

That is the goal, but ideally you can "lab test" the different perceived configurations in order to come to a "best fit" solution for your eventual deployment solution.

I would also strongly recommend the "free" ( as in beer ) courses available as already mentioned, and no matter what your level of knowledge. I find that various course material sources often offers "hidden gems" to give more insight into things that you may not have considered or otherwise overlooked. The M102 Class as mentioned is constructed and conducted by Adam Commerford for whom I can attest has a high level of knowledge on large scale deployments of MongoDB and other data architectures. Worth the time to at least consider a fresh perspective on what you may think you already know.

Best Answer

Related Solutions

Mongodb – Improving concurrency in MongoDB via sharding

MongoDB: co-locate the mongos process on application servers

The Background Case

Version Specific / Usage Specific

Test, Test and then Test Again

Related Question