Mongodb – the best way to reload data in MongoDB

mongodboptimization

I am benchmarking MongoDB using YCSB framework using the latest drivers.

During the benchmark, I change the cluster setup (adding/removing shards) extensively and delete/insert workloads in the order of 5GB regularly. To try and make this process quicker I have tried two approaches,

In the first approach I use db.collection.remove() function to remove documents from the data which takes around 20 mins to remove 5GB of documents from one shard. This method preserves the indexes, which results in an increased insertion throughput.

In the second approach I use db.collection.drop() function to remove documents from the data which almost instantly removes the data. However, insertion takes much longer due to balancing the data across shards.

Is there a better approach to do these tasks? If not, which one of the two
is recommended?

Best Answer

I'd recommend looking into pre-splitting and/or using a hashed shard key to do the insertion and stick with dropping the collection (with remove you are basically doing a delete for every write, so it will always be slow). The hashed shard key is usually the easiest one to get started with.

If you are looking to measure write throughput then each of those approaches (when done properly) will allow you to utilize all available shards immediately rather than hitting one shard first, waiting for the balancer to move data etc. (also, if you use a monotonically increasing shard key then you will only ever write to a single shard - see the three pitfalls posts for good descriptions of all that).