Mongodb – Need a solution to store cheaply a big MongoDB collection

mongodbstorage

We have the typical problem of Twitter's timeline tweets, or Facebook profile's news:
we need to show the users only the new information (storing them in a fast db), and put all the old ones somewhere else (but able to load them on request in a reasonable time).

The reason for moving the old ones is both economic (we don't need and don't want tens of GB accessed rarely occupying space on a fast production db) and for performance (tens of GB of extra information gives much overhead for indexes and queries).

Now we have all the information, recent and old, on a MongoDB hosted on MongoHQ. Currently everything is on its own specific database, that contains only one big collection.
I would like to have only the most recent ones in the fast production db, and to take the older ones, such as the data older than let's say one month, and moving them in another database, or to another storage. I would set up a nightly job that will do the stuff so to maintain only recent data on the MongoDB.
I don't want to manage the database by myself now, so I was looking for an hosted solution.

I think we have two options for putting old data:

  1. storing the data on another db hosted in the cloud, cheaper than the Mongo hosting we currently use as primary db, such as Amazon's DynamoDB, or Google Big Query. Going for this would be to transform our MongoDB collection to be saved in another format, and then to transform it back when we need to read it again
  2. setting up "our" server, buying for example a package for MongoDB hosted on EC2. Going for this means adding simply another db to our infrastructure, but it is a quite expensive solution as those services provide high performances database, and we don't need that.

I think that (2) is the easy solution to implement, but (1) would be much cheaper giving that the data will be rarely accessed and the new db would be used much as a storage facility.
In the next months, we will definitely go for (2), as if data and usage increase I think that the best option would be to manage most of the things by ourselves instead of relying to hosting on PaaS or SaaS, but now this solution seems quite expensive.

Do you have any suggestion for a situation like this? Is there a third way that I haven't considered?

Best Answer

Short answer - this is a tricky proposition and your available options really depend on how much effort you wish to put in versus the cost of your option 2 (which I will talk about below). Having never done it, I can't speak to the ease (or lack thereof) of using another cloud based solution for storing the data elsewhere, but I can comment on some of the options on the MongoDB side.

First up, in terms of putting in the effort, a slight twist on your 2 proposals:

  1. Create a less expensive instance in the cloud
  2. Implement aging/removal on the live database (either manually, or with TTL or capped collections)
  3. Use a tailable cursor to pull all operations out of your oplog and insert into your new instance(s). Alternatively you could look at the Mongo Connector for this functionality to avoid writing it yourself, but that will not work out of the box with TTL (TTL deletes are replicated from the primary and would need to be filtered out manually)

Caveats:

  • Even though it might be less expensive, it still has to be able to keep up with your operations - you do not want to have it lag behind and end up missing data
  • This is essentially doing a form of replication yourself, it's not officially supported anywhere - you have to support this yourself
  • You will need to come up with your own method to detect if this is broken so you can stop deletes, recover from failure, and your own method to restore data etc.
  • Manual deletes, or TTL based deletes will eventually lead to fragmentation and inefficient re-use of space, so you will probably need to regularly compact and/or repair to reclaim space

The above is usually too much effort/hassle and too much of an unknown from a supportability perspective unless you plan to run this way long term and are very comfortable supporting such a customized solution.

Your option 2, is often far easier and there are ways to save on costs. You can, for example, have a hidden secondary that is different from the rest to reduce costs. Imagine this kind of set up:

  • Primary, Secondary - both equivalent spec, enough memory to hold working set for live site, generally expensive - uses capped collections to manage data storage
  • Hidden Secondary - less expensive spec (RAM, slower disk), never takes live traffic, keeps all data, perhaps no indexes also - good source for snapshot backups too usually

To get the hidden secondary set up correctly will take some work. You need to configure it to not build indexes, you need to pre-create any Capped collections without the capped configuration so they do not delete data (they will then continue to grow).