MongoDB Monthly Data Archival – Best Practices

archivemongodbpartitioning

Coming from a RDBMS background, this is what I have done in the past.

Create a table with partitions based on a date field. There are typically three tables (partitions) – current (M0), previous month (M1) and previous to previous month (M2).
Detach M2, download data and drop table. Create M2 (empty) again.

How do we achieve the same with MongoDB? I have read about Capped Collections and TTL setting. doesn't seem to fit the bill. There is ranged sharding too; but, we aren't such a huge installation requiring a clustered deployment.

The aim is to have a 'small' data set so that queries run efficiently and API invocations (depending on these queries) do not incur latency.

Best Answer

As at MongoDB 4.0, there isn't built-in support for partitioned collections but you can implement a similar pattern:

Your application would always write new entries to collection M0 (current)
You would set up a scheduled task to rotate collection names at the end of the month using renameCollection (renaming M1 to M2 and M0 to M1)
Your scheduled task would download the data from M2 and drop this collection when no longer needed

This approach should minimise storage fragmentation since your M0 collection is recreated every month rather than being long-lived, however it would limit your ability to easily run simple queries against data spanning both M0 and M1.

I have read about Capped Collections and TTL setting. doesn't seem to fit the bill. There is ranged sharding too; but, we aren't such a huge installation requiring a clustered deployment.

Capped collections have a number of limitations and aren't appropriate for this use case. Capped collection growth is limited by size in bytes or number of documents, but you cannot control document expiry time. You also cannot directly delete documents in a capped collection as they are designed for FIFO (First In, First Out) use cases.

A TTL (Time-To-Live) index is a definite candidate for your use case. Documents have a TTL expiry date based on a number of seconds or a specific clock time. You could use the first type of expiry to smooth out deletions over time (so documents expire after ~62 days) or the second type of expiry to set a stricter date based on the first day of the month when the documents would normally be dropped as part of M2.

Sharding adds a lot of operational complexity if your only use case is removal of archival data. The relevant feature to consider is zone-based sharding which would allow you to associate contiguous ranges of a shard key with specific shards. This is more typical for an archival use case where older data should remain online but move to lower cost/performance storage. For more information see Tiered hardware for varying SLA or SLO in the MongoDB documentation.

Related Solutions

Data archival strategy that handles schema changes

You need to define your requirements more specifically before even thinking about a solution:

Why is archiving necessary? It sounds like the system already handles old data, so what is the business need to separate out this data? Performance?
Is archive data a read-only snapshot, or are historical data changes possible? If changes are possible, which types of changes will be supported (insert, update, delete, or some combination of those)?
Is the database multi-tenant, and if so, do you require the ability for each tenant to be able to archive at different points in time?
Does your application need to run with the archive data as a data source? I'm assuming yes since you mention synchronizing schema changes.
What is the minimum version/edition of the DBMS you need to support? This will determine which features are available to use in your strategy.
How much time do you have to implement archiving? Archiving is a very low-level design issue that ideally should be baked in right from the start; adding it on later can potentially take a significant amount of time to redesign and implement.

Now having said all that, one thing I would advise you, having done it here (SQL Server), is this: avoid multiple databases if at all possible, particularly if you require the ability to edit historical data.

While I can't give away our IP, I'll tell you that the process involved with shuffling data from the "live" database to the "archive" database is extremely complex if you take a dynamic approach like we did for ~700 tables. Depending on the state of your database schema, that kind of thing may not even be possible to accomplish, or result in data discrepancies you don't expect. If you don't have very many tables (< 200) and the schema is in rough shape, quite honestly, do not take a dynamic approach, or wait until it's cleaned up to a decent state. With a large number of tables and a rough schema, multiple databases is not a viable solution.
As you mentioned, you do have to run update scripts against multiple databases, and you have to keep track of all the databases somehow. It's really easy for things to become desynchronized, i.e., an archive database is moved to a different server or instance and your configuration database or table holds the old information. The choice is either to always synchronize the schema, or write applications to always be backwards compatible. (Hint: it's far easier to synchronize the schema.)

While I certainly don't recommend multiple databases, it is a possible solution depending on your requirements.

MongoDB store weekly data

have you looked at having Historic Players and Historic Team collections?

Then if you wanted stats about the players at the time, you'd probably want to copy out the current players on the team in your historic games documents. You'd probably want to store that as an array in the Historic Games table as the current players list might be different than the players from last week/year.

Keep the position of the players in the document in mind so you can end up using proper indexes.

historic players
_id: jfoi23j8xx
PlayerID:Bo Jackson
weekinfo<array or tags> 
[week 1- hits:30, misses:3, bunts:7, TeamID: LA Raiders], 
[week2 - hits:17, misses:2, bunts:3, TeamID: Kansas City Royals]


historic teams:
_id:10u40o321js
TeamID: LA Raiders
Weekinfo<array or tags>
[week 1 - consolidated hits: 73, misses: 12],
[week 2 - consolidated hits:65, misses: 16]

Something like that might be a good start. Does it meet your requirements? You might want to have your app handle the aggregation but that's up to you.

Best Answer

Related Solutions

Data archival strategy that handles schema changes

MongoDB store weekly data

Related Question