There are a couple of separate points here, but I don't think how MongoDB stores data in RAM is really relevant here - MongoDB just uses the mmap()
call and then lets the kernel take care of the memory management (the Linux kernel will use Least Recently Used (LRU) by default to decide what to page out and what to keep - there are more specifics to that but it's not terribly relevant).
In terms of your issues, it sounds like you might have had a corrupt index, though the evidence is somewhat circumstantial. Now that you have done a repair (the validate() command would have confirmed/denied beforehand), there won't be any evidence in the current data but you may find more evidence in the logs, particularly when you were attempting to recreate the index, or using the index in queries.
As for the spikes in the page faults, btree stats, journal, lock percentage, and average flush time, that has all the hallmarks of a bulk delete that causes a lot of index updates, and causes a large amount of IO. The fact that mapped memory drops off later in the graphs would suggest that once you ran the repair the storage size was significantly reduced, which usually indicates significant fragmentation (bulk deletes, along with updates that grow documents are the leading causes of fragmentation).
Therefore, I would look for a large delete operation logged as slow in the logs - it will only be logged once complete, so look for it to appear after the end of the events in MMS. One of the quirks of not running in a replica set is that a bulk operation like this is relatively non-obvious - it shows up as a single delete operation in the MMS graphs (usually lost in the noise).
These bulk delete operations usually tend to be run on older data that has not been recently used and has hence been paged out of active memory by the kernel (LRU again). To delete it you must page it all the data back in, then flush the changes to disk, and of course deletes require the write lock, hence the spikes in faults, lock percentage etc.
To make room for the deleted data, your current working set is paged out, which will hit performance on your normal usage until the deletes complete and the memory pressure eases.
FYI - when you run a replica set, bulk ops are serialized in the oplog
and hence replicated one at a time - as such you can track such operations by their footprints in the replicated ops stats of the secondaries. This is not possible with a standalone instance (without looking in the logs for the completed ops) and other secondary indications.
As for managing large deletes in the future, it is generally far more efficient to partition your data into separate databases (if possible) and then drop the old data when it is no longer needed by simply dropping the old databases. This requires some extra management on the application side but it negates the need for bulk deletes, is far quicker to complete, limits fragmentation, and dropped databases also remove the files on disk, preventing excessive storage use. Definitely recommended if possible with your use case.
The delay you are seeing when loading from a snapshot is not cause by how indexes are laid out on disk, it's far more likely that you are seeing the delay because when you start an instance from a snapshot, the data is loaded only on first use, and will be significantly slower than subsequent uses - that is a basic limitation of using snapshots in this way and really has little to do with the application that is trying to access disk. That's why you will see guides on "how to warm up an EBS volume" and the like (there are penalties on writing for the first time too). If you do that (warm up the disk with another application like dd
for example) and the performance issue goes away, then you have pretty decent proof that the layout of data has nothing to do with the issue.
Along those lines, MongoDB has the touch command that will allow you to warm up the data before you use it in anger (you can touch data, data and indexes or just indexes). Again, after you initially attach the volume, it will be slow, and touch is going to take a while, but at least after that warming up phase, your results should be somewhat consistent.
In terms of how things are stored on disk, you have the basics correct in terms of file allocation but there is a logical structure within the files, extents, that are the real units of storage. That and far more is covered in detail by this presentation by Mathias Stearn - one of the kernel developers at MongoDB.
Indexes are just another (structured) form of data in MongoDB and they are stored in linked extents throughout the file. Fragmentation can become an issue (that's what the compact command is for) as can disk space used (the repair command is used to reclaim) but you haven't described a workload that would immediately make me think you are hitting a fragmentation issue which is why I suspect something else (like the first use penalty) is your root cause.
Best Answer
First of all, let us note that you use MMAPv1 as your storage engine, simply because it is the only available option in MongoDB 2.6
MMAPv1 comes with a few intricacies.
How does it come that that much space gets (pre-)allocated?
This is mainly due to two reasons: The fact that datafiles get preallocated (which is a requirement for the inner workings of MMAPv1) and the fact that the datafiles are fragmented (though documents are never!). Let us have a look at the details of those two reasons.
Datafile preallocation
Let us start small. Let us assume we just started our MongoDb replica set. File allocation has a rule:
The reason behind that is that under certain conditions, the datafile needs to be filled with zeroes to allocate the space – which takes quite some time. So, for performance, you want this work to be finished when you need the datafile.
But let us look a bit further. So, the first file to be allocated has 64MB, and this is where the
local
database is saved. As per the rule that there is always a pristine datafile, a new one gets allocated with 128MB (the size of the new file always gets doubled until 2GB are reached).So we now have 192MB allocated, albeit it may well be that we just have a few kB saved so far. Now, let us assume that you store 4 documents with a max size of 16MB. The first three ones would fit into the 64MB data file, bit for the last one, there would be a few KB too less space.
The last document will be written in its entirety to the pristine datafile, since
Now, our former pristine datafile holds data, a new datafile of 256MB gets allocated. The total size of our data files now is 256+128+64 = 448MB, albeit only 64MB plus a few KB are saved (setting aside the oplog at the moment).
Datafile fragmentation
While documents are guaranteed to never be fragmented, the datafiles almost always are.
Now, let us assume you have a running application, you create a lot of documents, you delete a lot of documents, all with an average size of lets say 10MB.
The space of the deleted document only gets marked as free, pretty much as with filesystems.
Now, you want to add a file with the max size of 16MB. In a worst case scenario, albeit there is plenty of space available,
mongod
can not find a place in the existing datafiles where it can accommodate said document, writing it to the pristine datafile and allocates a new one. So this behavior alone can leave you with almost 4GB of datafiles(almost 2GB of the old preallocated plus 2GB of the newly allocated) with only a single document. In your case, that would be 40% of the total size of the datafiles unused.When a lot of deletions happen after this, you can have several GB of allocated datafiles which are unused.
It has to be said that while the quota of unused space is quite big in your case, this levels out for bigger data. Let me put it that way: you are on the lower, less space efficient end of the scale.
What can I do against it?
To be honest with you: not too much when it comes to preallocation. With 4GB of data, you will (almost) always have a 2GB datafile preallocated, resulting in roughly 30% of unused, but preallocated and necessary space (
mongod
will cause problems if it can not preallocate a datafile).To remedy the problem of datafile fragmentation, you can use the
compact
command, which more or less exactly does what you'd expect from it. Make sure you read and understand its documentation before using it.Another option to reduce datafile fragmentation is to shut down a secondary, delete the contents of its
dbpath
and start it again. It will then perform a resync of the data from the primary, writing the data as a contiguous stream into the data files. After the sync is finished, repeat the process for the second secondary (if existing). If that is done, have the primary step down and repeat the process on the machine of the former primary. However, this reduces or (in case of a 3 member replica set with an arbiter) eliminates the redundancy.Some people suggest to use the repair command to reclaim disk space. While this is possible, personally I strongly advice against that. It comes with some serious drawbacks:
So what?
Imho, leave it as it is. While the quota of unused space is horrible at the moment, it will get better over time. And the speed of the increased disk usage in relation to the data size will slow down drastically. However, as always: Keep your eyes on it, since your use cases may be different from the (more or less) standard behavior described here.
hth