Coming from a RDBMS background, this is what I have done in the past.
- Create a table with partitions based on a date field. There are typically three tables (partitions) – current (M0), previous month (M1) and previous to previous month (M2).
- Detach M2, download data and drop table. Create M2 (empty) again.
How do we achieve the same with MongoDB? I have read about Capped Collections and TTL setting. doesn't seem to fit the bill. There is ranged sharding too; but, we aren't such a huge installation requiring a clustered deployment.
The aim is to have a 'small' data set so that queries run efficiently and API invocations (depending on these queries) do not incur latency.
Best Answer
As at MongoDB 4.0, there isn't built-in support for partitioned collections but you can implement a similar pattern:
M0
(current)renameCollection
(renamingM1
toM2
andM0
toM1
)M2
and drop this collection when no longer neededThis approach should minimise storage fragmentation since your
M0
collection is recreated every month rather than being long-lived, however it would limit your ability to easily run simple queries against data spanning bothM0
andM1
.Capped collections have a number of limitations and aren't appropriate for this use case. Capped collection growth is limited by size in bytes or number of documents, but you cannot control document expiry time. You also cannot directly delete documents in a capped collection as they are designed for FIFO (First In, First Out) use cases.
A TTL (Time-To-Live) index is a definite candidate for your use case. Documents have a TTL expiry date based on a number of seconds or a specific clock time. You could use the first type of expiry to smooth out deletions over time (so documents expire after ~62 days) or the second type of expiry to set a stricter date based on the first day of the month when the documents would normally be dropped as part of
M2
.Sharding adds a lot of operational complexity if your only use case is removal of archival data. The relevant feature to consider is zone-based sharding which would allow you to associate contiguous ranges of a shard key with specific shards. This is more typical for an archival use case where older data should remain online but move to lower cost/performance storage. For more information see Tiered hardware for varying SLA or SLO in the MongoDB documentation.