MongoDB: move documents before capping

backupmongodb

The cappedCollection concept works well for most of my projects where cleaning old data without care makes sense.

For another projects, I need a more complex and safe concept. The requirement is nearly the same as logrotate.
The data is appended to the main collection, without compression/compact and no index except a timestamp for simple queries by time. => The focus is on writing and persistent data.

Similar to the logrotate rules, I'd like the main collection not to become too large => capped by size; if possible, capping by timestamp might be a plus.

This sounds like a cappedCollection, but I do not want any data loss when it's capped. The old data should be stored into another db's collection that must be compact:true and a non-capped collection. It's name depends on the current month and makes sure that there will be max 12 "archive" collections per year.

Example:

liveDB.mainCollection_capped grows and starts capping.

Before removing old documents, these are savely moved into archiveDB.compactArchiveCollection201303.

No data is lost and the main collection remains small and fast. Storing the data in another database avoids db locks, e.g. repairDatabase tasks on an archive file will not affect or delay the main collection.

Is there a good practice or how to achieve this – as reliable and automated as possible – without writing all the data transfer for a cronjob which handles the data transfer but should never ever be missed because data is lost if capping starts before old data is copied into the archive.

Best Answer

For the first part, expiring based on a timestamp, you will want to check out TTL collections (requires version 2.2+). They won't cap size like a capped collection, but let's you set the constraints via time as long as you have a BSON Date type field to index on.

For the second part, automatically storing them to a new database before expiry, I think what you need is a periodic query (via cron, at, or other and in the language of your choice) to run and grab those documents before they expire. This should just be a simple range based query over time, find the appropriate docs, insert to new DB, always at enough of an offset to avoid missing documents.

The other approach I can think of would be to tail the oplog, look for deletes on that collection, and then transcode them into inserts on a different DB.

For extra safety you could run a secondary on a slavedelay to give an extra window if one of the approaches above failed.

I don't know of anything that will do this by default, but I believe either of the approaches above should work.