MongoDB insertion in two collections or one to improve insertion speed

bulk-insertmongodbmongodb-3.2

I have a collection, in which I have too many records (~300 million).
And the insertion rate is also very high and it takes time to insert documents, which is my main problem.

As I have two types of reports. Lets call them report_type_1 and report_type_2. Currently I insert both these reports in same collection.
Will it make insertion any faster if I use different collections for these two different types of reports?

Simply, what is faster? Inserting 1000 records per minutes in a single collection or inserting 500 records per minute in two different collections?

Best Answer

As I have two types of reports. Lets call them report_type_1 and report_type_2. Currently I insert both these reports in same collection. Will it make insertion any faster if I use different collections for these two different types of reports?

The outcome will depend on the storage engine you are using and whether concurrency control is the limiting factor for your insertion rate. The MMAPv1 storage engine has collection-level locking, while the WiredTiger storage engine (default in MongoDB 3.2+) has document-level concurrency control.

If you are pushing the limits of MMAPv1 locking for your deployment, splitting the reports into different collections may improve your insertion rate unless the underlying issue is a resource bottleneck (slow disk, insufficient RAM, ...).

The WiredTiger storage engine has more granular concurrency control as well as data compression (which reduces I/O but adds CPU overhead). With WiredTiger you should be able to increase insertion throughput to a single collection by adding more insertion threads in your application (at least until your server resources are saturated).

Rather than trying to design around possible MMAPv1 limitations with multiple collections, I would encourage you to test a single collection with the WiredTiger storage engine and multiple insertion threads.

Outside of storage engines, a more general question is whether the different report types should logically be in the same collection. If they have different fields or indexes and you don't query across report types, you may find it more efficient to use separate collections.

Related Question