Mongodb – collection partitioning in mongodb

mongodb

Is there any semantic like table partitioning in MongoDB? My data is not that much that it would need sharding. but data is not important(but not completely useless) after several months.

Best Answer

If you mean is there an archival type option built into MongoDB, the answer is: not yet.

You can take the manual approach, mongodump the data out, store it elsewhere and then delete it from your current data set for example. Or move the data to a different collection and ensure that the collection is as compact as possible (by running a repair for example).

At the moment though there is nothing to do this kind of operation automatically.

Related Solutions

Mongodb – Should I deploy Mongodb sharding for 50 collections

You could use a sharded cluster to distribute the collections more evenly across your available hosts. In general I would recommend not going beyond 5 shards (with at least 2 data bearing nodes for redundancy, plus an arbiter to break ties).

Once you have your 5 shards (or however many you end up with), you can then "pin" the collections to a particular shard with tagging. I could explain that process in detail, but as usual, Kristina Chodorow has beaten me to it:

http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/

Although it is not required, and you can have a single database if you wish, I would suggest having more than a single database in order to benefit as much as possible. Locking is at the database level, so concurrent writes will still be competing if you have more than one collection on a shard in the same database.

In fact, there is a simpler solution to this than tagging (though tags are the most scalable solution). If you create separate databases for all of your collections (or at least a subset), you can have each database reside on a different shard and not shard the collections at all.

To explain: without enabling sharded collections, all data for a given database will reside on the "primary" shard for that database (primary shards are designated in a round-robin fashion). Hence if you create 5 databases on a 5 shard cluster (for example) each will have a different shard for its primary and you will have achieved a crude (but effective) distribution of load by using separate databases. I added more detail regarding this in a previous answer.

Mongodb – Need a solution to store cheaply a big MongoDB collection

Short answer - this is a tricky proposition and your available options really depend on how much effort you wish to put in versus the cost of your option 2 (which I will talk about below). Having never done it, I can't speak to the ease (or lack thereof) of using another cloud based solution for storing the data elsewhere, but I can comment on some of the options on the MongoDB side.

First up, in terms of putting in the effort, a slight twist on your 2 proposals:

Create a less expensive instance in the cloud
Implement aging/removal on the live database (either manually, or with TTL or capped collections)
Use a tailable cursor to pull all operations out of your oplog and insert into your new instance(s). Alternatively you could look at the Mongo Connector for this functionality to avoid writing it yourself, but that will not work out of the box with TTL (TTL deletes are replicated from the primary and would need to be filtered out manually)

Caveats:

Even though it might be less expensive, it still has to be able to keep up with your operations - you do not want to have it lag behind and end up missing data
This is essentially doing a form of replication yourself, it's not officially supported anywhere - you have to support this yourself
You will need to come up with your own method to detect if this is broken so you can stop deletes, recover from failure, and your own method to restore data etc.
Manual deletes, or TTL based deletes will eventually lead to fragmentation and inefficient re-use of space, so you will probably need to regularly compact and/or repair to reclaim space

The above is usually too much effort/hassle and too much of an unknown from a supportability perspective unless you plan to run this way long term and are very comfortable supporting such a customized solution.

Your option 2, is often far easier and there are ways to save on costs. You can, for example, have a hidden secondary that is different from the rest to reduce costs. Imagine this kind of set up:

Primary, Secondary - both equivalent spec, enough memory to hold working set for live site, generally expensive - uses capped collections to manage data storage
Hidden Secondary - less expensive spec (RAM, slower disk), never takes live traffic, keeps all data, perhaps no indexes also - good source for snapshot backups too usually

To get the hidden secondary set up correctly will take some work. You need to configure it to not build indexes, you need to pre-create any Capped collections without the capped configuration so they do not delete data (they will then continue to grow).

Best Answer

Related Solutions

Mongodb – Should I deploy Mongodb sharding for 50 collections

Mongodb – Need a solution to store cheaply a big MongoDB collection

Related Question