MongoDB Monthly Data Archival – Best Practices

archivemongodbpartitioning

Coming from a RDBMS background, this is what I have done in the past.

  1. Create a table with partitions based on a date field. There are typically three tables (partitions) – current (M0), previous month (M1) and previous to previous month (M2).
  2. Detach M2, download data and drop table. Create M2 (empty) again.

How do we achieve the same with MongoDB? I have read about Capped Collections and TTL setting. doesn't seem to fit the bill. There is ranged sharding too; but, we aren't such a huge installation requiring a clustered deployment.

The aim is to have a 'small' data set so that queries run efficiently and API invocations (depending on these queries) do not incur latency.

Best Answer

As at MongoDB 4.0, there isn't built-in support for partitioned collections but you can implement a similar pattern:

  • Your application would always write new entries to collection M0 (current)
  • You would set up a scheduled task to rotate collection names at the end of the month using renameCollection (renaming M1 to M2 and M0 to M1)
  • Your scheduled task would download the data from M2 and drop this collection when no longer needed

This approach should minimise storage fragmentation since your M0 collection is recreated every month rather than being long-lived, however it would limit your ability to easily run simple queries against data spanning both M0 and M1.

I have read about Capped Collections and TTL setting. doesn't seem to fit the bill. There is ranged sharding too; but, we aren't such a huge installation requiring a clustered deployment.

Capped collections have a number of limitations and aren't appropriate for this use case. Capped collection growth is limited by size in bytes or number of documents, but you cannot control document expiry time. You also cannot directly delete documents in a capped collection as they are designed for FIFO (First In, First Out) use cases.

A TTL (Time-To-Live) index is a definite candidate for your use case. Documents have a TTL expiry date based on a number of seconds or a specific clock time. You could use the first type of expiry to smooth out deletions over time (so documents expire after ~62 days) or the second type of expiry to set a stricter date based on the first day of the month when the documents would normally be dropped as part of M2.

Sharding adds a lot of operational complexity if your only use case is removal of archival data. The relevant feature to consider is zone-based sharding which would allow you to associate contiguous ranges of a shard key with specific shards. This is more typical for an archival use case where older data should remain online but move to lower cost/performance storage. For more information see Tiered hardware for varying SLA or SLO in the MongoDB documentation.