MongoDB document delete and fragmentation

mongo-repairmongodbnosql

I've a MongoDB collection that works like a queue: new documents are insert and old documents (after 60days) are removed. I can see a rapid grown of the datafile size, too rapid. I can be reasonable because we remove old data after 60 days, but I was thinking, are my deletes effective without execute the defragmentation? (in few words, what's a good way to manage disk space in MongoDB)

What is a correct defragment / clean collection politics? It's a production database and version is 2.6.9

thanks.

Best Answer

Reasons for unexpected data growth of data files.

"Data fragmentation" and data file preallocation

When a document is deleted, it's space is used right away if the new document fits into that space. Let's say you delete a document which takes 1kb of disk space and a new document requiring 0.9 Kb of disk space is synced to disk, then the first free space (the deleted document's in our example) will be used. Now let's assume the new document would need 1.1k. In a worst case scenario, a new datafile of 2GB had to be provisioned although only 0.1kb of space was missing. The reasons for datafiles being preallocated are rather good ones, btw: it would simply take too long during a disk sync.

Padding

When a document is written, some space is added to allow the document grow in size without triggering a rather expensive document migration each time. Documents are migrated when they do not fit into their position in the data file any more since

Documents are never fragmented

So if your documents grow, they have to be migrated and a new padding is applied, it might well be that millions of places in the data files would provide enough space for a 1k document, but still a new data file has to be preallocated.

Another "problem" is the way padding is calculated. As of MongoDB 2.6, documents are by default by using power of 2 sizes. So let's assume your document is 513 bytes in size. However, since the next power of 2 would be 1kb, almost half of the space allocated for the document would not be in use until it grows in size. So in a worst case scenario, half of the space allocated for your data files -1 byte might be "wasted".

Increased usage

Your application might well be getting momentum and there simply is more data stored than you expect. Congratulations!

What to do

Usually, one of three ways of dealing with data file growth is suggested.

  1. the compact command
  2. the repair command
  3. Forcing a resync from replica set members

I'll go over them with their Pros and Cons from my point of view and explain why I think all of them are improper ways of dealing with that data file growth.

The compact command

How it works

The compact command defragments the data files of a collection. It does it by creating a new data file of 2GB and moves the documents back and forth until there are no gaps between the documents any more.

Pros

The compact command is relatively fast when compared to the other solutions. The defragmentation helps a bit to prevent unnecessary data file preallocation.

Cons

  1. The database containing the target collection is locked during the execution.
  2. No disk space is reclaimed
  3. You really should have a backup of the target collection before using the compact command. So in order to have said backup, you need to over provision your disks with 2Gb (the additional data file) plus the size of your largest collection (for the backup). But with over provisioned disks, space will not be a problem in the first place.
  4. It doesn't help at all when space really is a problem: if you are in a critical situation, the problems detailed above prevent you from using the compact command.

Why I don't think it is a proper solution

Well, it's kind of obvious - you lock your database, which means downtime. For really large databases, this means a lot of downtime, and all this for the relatively small gain of potentially preventing one or two data files to be created (which means 4Gb disk space at max).

The repairDatabase command

How it works

Simplified, the repairDatabase command creates a second instance of your database, iterates over the documents in the original database, verifies them and writes them into the new database in consecutive order. In the last step, the old database is deleted and the new database is renamed.

Pros

With a proper planning, you can reclaim disk space with very little downtime, since the repairDatabase command can be run against secondaries. So you can do the following

  1. Run the repairDatabase command against all secondaries
  2. Have the primary step down. This might lead to 3-5 seconds of downtime during the election of the new primary.
  3. Run the repairDatabase command against the recently stepped down primary

Sounds nice, right? However, there a huge

Cons

You need to massively over provision your disks, since basically a copy of your database is made. So now let's assume you run this command against a database which is in an optimal state. So to make sure the command is successfully executed you need at least the same amount of free disk space as your database uses when you issue the repair command. Since the repair command is potentially even more critical than the compact command, you should make a backup beforehand or use the backupOriginalFiles option.

Why I don't think it is a proper solution

The cons detailed above show that you have to over provision your disks by at least 200% of your payload data. With that massive amount of disk space, you would not have a problem in the first place.

Forcing a resync from replica set members

How it works

You shut down a secondary, delete its data files and restart it. The node notices that it is basically a new member added to a replica set and forces an initial sync with the replica set. Since the initial resync is document oriented, only necessary datafiles are allocated, potentially freeing formerly used disk space.

Like with the repair command, you do this for all secondaries (of course one after another), have the primary step down and delete its datafiles and let it resync.

Pros

  1. You do not need to over provision the disks of an individual node
  2. There is just very little downtime
  3. It is a relatively straightforward process

Cons

This process takes a while, may well have some impact on performance and reduces your planned level of redundancy. Let me explain this in a bit more detail: When planning a replica set, you choose how many replicas you want to have, ranging from one (two data bearing nodes plus an arbiter) to 50 as the time of this writing. You have a good reason for this redundancy, whatever it may be. When arbitrarily shutting down replica set members in order to reclaim disk space, you effectively reduce or even eliminate failover capabilities. So it is safe to say that in order to keep your desired level of redundancy during the resync, you need one additional node to maintain it.

Why I don't think it is a proper solution

Put plainly: putting half the money you spend for the additional node into additional disk space should solve any space problem in the first place. However, this might not be in your case (although that might well be through under dimensioned hardware) and thus the resync might be a viable solution in some cases

Ok, smarty pants: What to do?

Frankly, from my experience, the need of reclaiming disk space is a sure sign of a badly planned cluster.

Granted, MongoDB is not the most efficient when it comes to disk space consumption, but after a while, it levels out. So when MongoDB constantly adds new datafiles, you can be sure that you simply need more disk space.

This can be either achieved through vertical or horizontal scaling. If you still can scale vertically and get an adequate bang for your buck, your hardware was underprovisioned until now. Go for it, problem solved!

If you already get the most bang for the buck and the size of your data (not only the number of your data files) constantly grows, it is time to scale horizontally, read to shard your cluster.

As a rule of thumb: when more than 80% of your disk space is used and the size of your data didn't show a massive spike but is constantly growing, I'd add a shard or start sharding. It requires some experience and knowledge to determine the exact threshold and how to do it exactly is out of scope even for this long answer.

With this approach, the decision when to shard is based on emprirical information, is started early enough to prevent serious problems, it reduces maintenance effort and risks and it enables you to scale properly.

One last word: Often people say that adding a shard is too expensive or they are not up to paying three config servers in addition to the data bearing nodes and start to shard their data manually. The reason for that is plainly wrong calculation of their own prices and a wrong understanding of how to do things sustainably. In the long run, it's going to bite you in the neck to reinvent the wheel.