Mongodb – Does mongodb automatically move small non-partitoned databases to spread read load

clusteringmongodb

I have a few hundred small databases, each between 0.2GB and 4.0GB in size. These are sitting on a sharded mongodb environment with 10 or so shards.

At any one time, only a very small subset of these databases are being (intensively) written to.

All of them are being intensively read from, all of the time (Target OTE 300,000 queries per second). I can exert enough control over the read order to spread the reads across the databases fairly evenly.

Right now, none of these small databases are partitioned.

When I look at the output of the db.printShardingStatus() command, I see that most of the databases are sitting on shard0001. Indeed, mongostat shows that most of the reads are hitting that one machine.

I have (so far) done nothing to try to influence which db goes on which machine.

My question is this: Left to it's own devices, will mongodb automatically move the primary for these small databases so that the load (eventually) ends up being more balanced, or do I have to intervene in the process myself?

(Or should I try to partition these databases over multiple machines so that the index size is smaller on each shard, then re-sequence my reads so that I hit each database one at a time?)

Best Answer

Left to its own devices, no, MongoDB will not move those unsharded databases to a different primary shard - the automatic balancing only applies to chunks from sharded collections.

It will round robin through your shards as the databases are created to spread them out across all the shards from that perspective. If you had one shard originally and expanded to many, the databases may have been concentrated on that shard - the round robin aspect only applies when you create the database, not the collections inside it.

Once the databases are created, and assuming you can predict what will be used and when, you can then move them to whatever shard you wish using the movePrimary command and distribute load accordingly:

http://www.mongodb.org/display/DOCS/movePrimary+Command

Naturally, this will be a quicker process if there is no data in the databases, but should still be possible after the fact.

Related Solutions

Mongodb – How to Determine Chunk Distribution (data and number of docs) in a Sharded MongoDB Cluster

There is currently no built-in way to do this, so a small function is needed. For the purposes of this answer I have created a 2 shard cluster with ~1 million documents as per these instructions. Next up I used this function to examine those documents:

AllChunkInfo = function(ns, est){
    var chunks = db.getSiblingDB("config").chunks.find({"ns" : ns}).sort({min:1}); //this will return all chunks for the ns ordered by min
    //some counters for overall stats at the end
    var totalChunks = 0;
    var totalSize = 0;
    var totalEmpty = 0;
    print("ChunkID,Shard,ChunkSize,ObjectsInChunk"); // header row
    // iterate over all the chunks, print out info for each 
    chunks.forEach( 
        function printChunkInfo(chunk) { 

        var db1 = db.getSiblingDB(chunk.ns.split(".")[0]); // get the database we will be running the command against later
        var key = db.getSiblingDB("config").collections.findOne({_id:chunk.ns}).key; // will need this for the dataSize call
        // dataSize returns the info we need on the data, but using the estimate option to use counts is less intensive
        var dataSizeResult = db1.runCommand({datasize:chunk.ns, keyPattern:key, min:chunk.min, max:chunk.max, estimate:est});
        // printjson(dataSizeResult); // uncomment to see how long it takes to run and status           
        print(chunk._id+","+chunk.shard+","+dataSizeResult.size+","+dataSizeResult.numObjects); 
        totalSize += dataSizeResult.size;
        totalChunks++;
        if (dataSizeResult.size == 0) { totalEmpty++ }; //count empty chunks for summary
        }
    )
    print("***********Summary Chunk Information***********");
    print("Total Chunks: "+totalChunks);
    print("Average Chunk Size (bytes): "+(totalSize/totalChunks));
    print("Empty Chunks: "+totalEmpty);
    print("Average Chunk Size (non-empty): "+(totalSize/(totalChunks-totalEmpty)));
}

It's pretty basic at the moment, but it does the job. I have also added it on github and may expand it further there. For now though, it will do what is needed. On the test data set described at the start, the output looks like this (some data removed for brevity):

mongos> AllChunkInfo("chunkTest.foo", true);
ChunkID,Shard,ChunkSize,ObjectsInChunk
chunkTest.foo-_id_MinKey,shard0000,0,0
chunkTest.foo-_id_0.0,shard0000,599592,10707
chunkTest.foo-_id_10707.0,shard0000,1147832,20497
chunkTest.foo-_id_31204.0,shard0000,771568,13778
chunkTest.foo-_id_44982.0,shard0000,771624,13779
// omitted some data for brevity
chunkTest.foo-_id_940816.0,shard0000,1134224,20254
chunkTest.foo-_id_961070.0,shard0000,1145032,20447
chunkTest.foo-_id_981517.0,shard0000,1035104,18484
***********Summary Chunk Information***********
Total Chunks: 41
Average Chunk Size (bytes): 1365855.024390244
Empty Chunks: 1
Average Chunk Size (non-empty): 1400001.4

To explain the arguments passed to the function:

The first argument is the namespace to examine (a string), and the second (a boolean) is whether or not to use the estimate option or not. For any production environment it is recommended to use estimate:true - if it is not used, the all the data will need to be examined, and that means pulling it into memory, which will be expensive.

While the estimate:true version is not free (it uses counts and average object sizes), it is at least reasonable to run even on a large data set. The estimate version can also be a little off if object size is skewed on some shards and hence the average size is not representative (this is generally pretty rare).

Mongodb – Does the MongoDB background flush (MMAP) update the entire document even if only a small portion has changed? i.e. set on an array position

The array contains 348 empty documents to begin with, and over the course of a week these array element will have sub-documents inserted (if empty) and then subsequently updated (if they already exist). The sub-documents are approximately 100 bytes in size and are not indexed.

One consideration with this use case is that your documents are consistently growing. MongoDB used a record allocation or padding strategy to allow documents to grow in-place. For example, if your document starts off as 1000 bytes MongoDB 2.6 or newer will round this up to a 1024 byte record allocation for MMAP (as per the Power of 2 Size default strategy). Updates that don't grow the size of the document beyond the current record allocation are more efficient for the server to execute.

However, if you added 100 bytes to a document which was initially 1000 bytes, the document would have to be moved to a new record allocation in storage (and associated index entries would also have to be updated). So in this example, the next allocation for a 1100 byte document would be 2048 bytes (allowing for ~9 more 100 byte fields to be added before a new record allocation was needed for this document). Indexes in MongoDB include the storage location of the document, so a document move will result in an update for every index entry referencing that document.

You can check the frequency of document moves by looking at the nmoved value for slow updates (or by enabling increased levels of logging / system profiling). Frequent document moves can definitely have a performance impact. Common strategies include either reconsidering the data model (eg. moving the growing portion of the document to a separate collection if appropriate) or adding manual padding to the initial document allocation. The default power of 2 allocation strategy is designed to avoid the need for manual padding in most cases, but if your documents start small and grow quickly you might be able to avoid some initial document moves.

So my question is, when the document is flushed to disk, what actually gets written, the entire document or just the sub-document that has been updated?

The answer will depend on the size of your document and the nature of updates since the last background flush. I'll assume you are using a default configuration with MMAP storage engine and journal enabled.

By default data changes are written twice: once to fast append-only journal files (committed to disk every 100ms) and again to a private view in memory (flushed to data files every 60s). The background flush process is a periodic asynchronous write of all pages that have been "dirtied" in memory since the last flush. Journal commit and background flush intervals can be influenced by both server configuration and write concerns. For a good overview of the process see How MongoDB’s Journaling Works.

The MMAP storage engine will fetch the full document into memory before applying updates. The standard x86 page size is 4KiB so a single document may be represented by one or more pages -- or multiple documents may be part of a single page in memory.

So, if you are updating a single document the writes will include:

all changes written to the journal
all changes written to the oplog (if that node is part of replica set)
any pages dirtied for that document since the last background flush

An important caveat is "since the last background flush". Multiple updates affecting the same pages within a given sync interval will effectively be batched.

If you're trying to get to the bottom of performance issues then consistently high background flush times (particularly as a large or increasing percentage of the default 60s flush interval) are definitely of concern, but should be reviewed in the context of other metrics such as page faults, I/O stats, and lock percentage. I would also review the MongoDB Production Notes for general tips and upgrade to the latest MongoDB production release for your major version (i.e. latest 2.6.x or 3.0.x if there's a newer x than your current version).

Best Answer

Related Solutions

Mongodb – How to Determine Chunk Distribution (data and number of docs) in a Sharded MongoDB Cluster

Mongodb – Does the MongoDB background flush (MMAP) update the entire document even if only a small portion has changed? i.e. set on an array position

Related Question