MongoDB scaling when working with files

mongodb

I have read through the MongoDB manual about sharding and replicasets.

However, I would like to know if the following can be achieved with sufficient performance (read/write):

  • Saving 10,000,000 files via GridFS on MongoDB instances
  • Total file size of about 2TB without indexes and journals
  • 10,000 writes / day
  • 10,000 reads / day
  • Querying for a document does not need respond instantly, 2 seconds is still acceptable

Setup in mind:

  • 1 Replicaset of 3 nodes
  • Each node with 32GB Ram, 2TB SSD
  • 1 Mongos instance
  • 1 Config server

In the case the proposed setup would not suffice to obtain the goal of storing 10,000,000 documents and writing/reading 10,000 a day with fairly good performance, what would be the next step to take?

  • Adding more RAM to each node
  • Adding more disk capacity to each node
  • Adding Another replica set with the same configuration
  • Sharding

I would like to avoid sharding as much as possible since I consider it best practice to avoid sharding whenever possible.
I feel this would only complicate the topology unnecessarily and this would most likely be overkill in this case

Some advice would be much welcome.

Cheers

Best Answer

My answer is : It depends. If you are accessing files by _id field, which is already indexed then you don't need to add more memory soon.

The _id field which is type of ObjectID is 12 byte in size. That means it can hold up to 2^(12*8) files. 3 byte is for machine ID which is a hash value and has a fixed vale on the machine can be subtracted which gives you approx 2^72 files. For the reference, 2^20 is 1,048,576.

In terms of the memory, the index on the _id field needs 10,000,000 x 12 byte = 114 MiBytes. To be honest, I don't now how much overhead there will be for an index which holds 10 millions value but I don't think that it will need more than 1 Gigabyte.

Now, if your _id field is not a type of ObjectID than do the math.

In the gridfs, filename value of the files collection is also indexed. If you are not accessing files using filename, then you may leave it blank and drop the index for the filename.

On the other side, if you will add some metadata to the files you added and want to query the files according to those metadata, then you should have indexes for those metadata and do the math again.

I have a production environment which has over 3,000,000 pdf files (takes 180 Gig space on the disk). My server is a virtual server which has 4 vCPU and 4 Gig RAM, still no problem. The specs you provided is way too high for your needs. You can save billions of files with those servers. Especially if you have SSD. Because even if your indexes do not fit into the memory, swapping will be very fast, you won't even notice a slowdown.