Mongodb – How to predict new disk allocation

mongodb

I understand that MongoDB pre-allocates disk space, but the disk allocation I'm seeing is growing much faster than I'm anticipating.

For example, I had a database using up 8GB of actual disk space. Today a new 2GB file was pre-allocated, bringing it up to 10GB on disk. I didn't expect this, because as it looked like my data (including wasted space) would comfortably fit.

Doing db.stats() I can see the dataSize for this database is ~2G, indexSize ~500M. The storageSize is ~4GB, so even with a lot of fragmentation it's still nowhere near the disk space now allocated.

What I'm trying to understand is why the 4GB wasn't happily contained in the first 6GB of files. Why would data have been written to the last 2GB of empty space, causing a further 2GB to be allocated?

Using the output from db.stats is it possible to predict with any certainty what data size will result in new disk allocation?

For detail: I'm running MongoDB 2.6 as a replica set, not sharded. My database contains thousands of collections, but each is not very large.

Best Answer

First of all, let us note that you use MMAPv1 as your storage engine, simply because it is the only available option in MongoDB 2.6

MMAPv1 comes with a few intricacies.

How does it come that that much space gets (pre-)allocated?

This is mainly due to two reasons: The fact that datafiles get preallocated (which is a requirement for the inner workings of MMAPv1) and the fact that the datafiles are fragmented (though documents are never!). Let us have a look at the details of those two reasons.

Datafile preallocation

Let us start small. Let us assume we just started our MongoDb replica set. File allocation has a rule:

There is always a pristine file preallocated, so that it is there when you need it.

The reason behind that is that under certain conditions, the datafile needs to be filled with zeroes to allocate the space – which takes quite some time. So, for performance, you want this work to be finished when you need the datafile.

But let us look a bit further. So, the first file to be allocated has 64MB, and this is where the local database is saved. As per the rule that there is always a pristine datafile, a new one gets allocated with 128MB (the size of the new file always gets doubled until 2GB are reached).

So we now have 192MB allocated, albeit it may well be that we just have a few kB saved so far. Now, let us assume that you store 4 documents with a max size of 16MB. The first three ones would fit into the 64MB data file, bit for the last one, there would be a few KB too less space.

The last document will be written in its entirety to the pristine datafile, since

Documents are guaranteed to never be fragmented.

Now, our former pristine datafile holds data, a new datafile of 256MB gets allocated. The total size of our data files now is 256+128+64 = 448MB, albeit only 64MB plus a few KB are saved (setting aside the oplog at the moment).

Datafile fragmentation

While documents are guaranteed to never be fragmented, the datafiles almost always are.

Now, let us assume you have a running application, you create a lot of documents, you delete a lot of documents, all with an average size of lets say 10MB.

When a document is deleted, the datafiles do not get compacted.

The space of the deleted document only gets marked as free, pretty much as with filesystems.

Now, you want to add a file with the max size of 16MB. In a worst case scenario, albeit there is plenty of space available, mongod can not find a place in the existing datafiles where it can accommodate said document, writing it to the pristine datafile and allocates a new one. So this behavior alone can leave you with almost 4GB of datafiles(almost 2GB of the old preallocated plus 2GB of the newly allocated) with only a single document. In your case, that would be 40% of the total size of the datafiles unused.

When a lot of deletions happen after this, you can have several GB of allocated datafiles which are unused.

It has to be said that while the quota of unused space is quite big in your case, this levels out for bigger data. Let me put it that way: you are on the lower, less space efficient end of the scale.

What can I do against it?

Note Use the procedures described here at your own risk. Make sure you understand what you are trying to do before you do it.

To be honest with you: not too much when it comes to preallocation. With 4GB of data, you will (almost) always have a 2GB datafile preallocated, resulting in roughly 30% of unused, but preallocated and necessary space (mongod will cause problems if it can not preallocate a datafile).

To remedy the problem of datafile fragmentation, you can use the compact command, which more or less exactly does what you'd expect from it. Make sure you read and understand its documentation before using it.

Another option to reduce datafile fragmentation is to shut down a secondary, delete the contents of its dbpath and start it again. It will then perform a resync of the data from the primary, writing the data as a contiguous stream into the data files. After the sync is finished, repeat the process for the second secondary (if existing). If that is done, have the primary step down and repeat the process on the machine of the former primary. However, this reduces or (in case of a 3 member replica set with an arbiter) eliminates the redundancy.

Some people suggest to use the repair command to reclaim disk space. While this is possible, personally I strongly advice against that. It comes with some serious drawbacks:

  1. It needs the space of your current datafiles x 2 + 2GB to be executed. If you have that much spare disk space, the size of your datafiles most likely is not a problem in the first place.
  2. Unless you are absolutely, positively sure that your data files are in a sane state (and I have yet to find a way to verify that), the repair command may leave your datafiles in an undefined state. It guarantees that your instance will still be usable, but that is pretty much about it.
  3. The command can only be run against a primary or a standalone. Which effectively means that you have to shut down a secondary, start it as a standalone, start the repair command, wait for the repair to be completed and then restart the instance as part of the replica set to regain redundancy. This is quite different to the resync method described above, where the redundancy is restored as soon as the data is resynced. Well, and of course the repair command will finish at about 3am in the morning, an hour after the primary blew up ;)

So what?

Imho, leave it as it is. While the quota of unused space is horrible at the moment, it will get better over time. And the speed of the increased disk usage in relation to the data size will slow down drastically. However, as always: Keep your eyes on it, since your use cases may be different from the (more or less) standard behavior described here.

hth