MongoDB – Creating TTL Index with Rolling Method

archivemongodb

Im getting stuck with archiving a huge amount of data in MongoDB 3.6

I want to delete 506 million records in a collection. I tried to remove using bulk.remove(), but that is also slow. 50 records are removing per second.

But somewhere I read, TTL index and do scan interval every 1 hour. So it'll remove in faster way.

But if I create this index in foreground, it'll lock the collection. So i'm thinking to do with the rolling index creation method.

If do like that, lets say on a 3 node replica set, detach the node3 and then create index. Once its created it'll start automatically remove the data. Then once I add the node back to the Replica set, maybe the primary will do the delete after I create the index, that time it'll try to replicate, in the worst case, the data is already removed on that node, then it'll break the replication?

Best Answer

Can I create TTL index with rolling method

Yes, this is a supported approach for building indexes on replica sets. However, if your goal is to efficiently remove a large quantity of existing documents there are some caveats to be aware of as noted below.

I tried to remove using bulk.remove(), but that is also slow.(50 records are removing per second).

A TTL index will not speed up removal of documents if you already have an index that supports finding expired documents: the TTL thread still needs to find & remove matching documents so will be doing similar work to a bulk remove.

I would investigate why your current bulk remove operations are slow. For example, make sure you have an optimal index in place to find documents to remove and monitor your system resources (memory, I/O, network, ...) to ensure there aren't any obvious bottlenecks.

If you have a large number of documents that are ready to be removed when the TTL index is created, this could have a significant performance impact. Bulk remove queries with a supporting index would allow more control over the impact since you can add query criteria to restrict the range of documents matching each bulk deletion.

But somewhere I read, TTL index and do scan interval every 1hour. So it'll remove in faster way.

That timing is incorrect: the TTL deletion task runs every 60 seconds. Based on an indexed date field the TTL monitor can either expire documents after a specified number of seconds has passed or expire documents at a specific clock time.

Assuming your documents have a range of expiry dates, once the initial removal of expired documents is complete a TTL index will be able to delete documents in smaller batches which will be less impactful than an infrequent bulk delete.

But if I create this index in foreground, it'll lock the collection. So im thinking to do with the rolling index creation method.

Prior to MongoDB 4.2, a foreground index build on a populated collection will block all other operations on the database that holds that collection. For a populated collection in a production environment you will definitely want to use either a rolling index build or a background index build. The rolling index build ensures that only one of your replica set members is building an index and allows a foreground index build to complete faster, however this approach does include some risk of that member becoming stale while running in standalone mode.

MongoDB 4.2+ uses an optimised index build process that limits the lock scope to the affected collection and only holds an exclusive lock at the beginning and end of the index build. You can still use the rolling index build approach but there is no longer a foreground vs background index build distinction.

If do like that, lets say on a 3 node replica set, detach the node3 and then create index. Once its created it'll start automatically remove the data.

The TTL index thread on replica set members only deletes documents when a member is in the primary state. Document deletes are replicated via the oplog so secondaries always have a consistent point of time with the current primary.

If you restart a replica set member in standalone mode, the TTL collection monitor will not be started (again, to keep the secondary state consistent).