Mongodb – if compact is required after move chunk

mongodb

I have mongodb (wired tiger) with two shards.
I added third shard and enabled the balancer.

I can see the logs where the data chunks are getting moved to the new shard.
After the movement of chunk where the disk space of old shards( from which data is moved) will be returned to the OS or compact command has to be executed.

Best Answer

As at MongoDB 4.0, chunk migration for a sharded collection does not automatically run compact or free disk space when complete. However, storage used by documents that have been migrated is marked as available for reuse for new data insertion/updates. An underlying assumption is that you will continue to add data to shards over time, so disk space does not need to be aggressively freed.

If you do want to free up disk space after significant changes in storage usage (for example, removing or migrating a large number of documents) you can manually compact collections.

It is important to note that the compact command currently blocks operations for the database being compacted, so is not advisible to run on a production system without careful consideration of impact. One approach would be to compact each of your original shards using rolling maintenance (compacting one secondary at a time). If you have a significant number of collections to compact, you could also re-sync secondaries (one at at time) to rebuild all data files without downtime.

Related Solutions

MongoDB Sharding – How to Determine Chunk Distribution in a Sharded Cluster

There is currently no built-in way to do this, so a small function is needed. For the purposes of this answer I have created a 2 shard cluster with ~1 million documents as per these instructions. Next up I used this function to examine those documents:

AllChunkInfo = function(ns, est){
    var chunks = db.getSiblingDB("config").chunks.find({"ns" : ns}).sort({min:1}); //this will return all chunks for the ns ordered by min
    //some counters for overall stats at the end
    var totalChunks = 0;
    var totalSize = 0;
    var totalEmpty = 0;
    print("ChunkID,Shard,ChunkSize,ObjectsInChunk"); // header row
    // iterate over all the chunks, print out info for each 
    chunks.forEach( 
        function printChunkInfo(chunk) { 

        var db1 = db.getSiblingDB(chunk.ns.split(".")[0]); // get the database we will be running the command against later
        var key = db.getSiblingDB("config").collections.findOne({_id:chunk.ns}).key; // will need this for the dataSize call
        // dataSize returns the info we need on the data, but using the estimate option to use counts is less intensive
        var dataSizeResult = db1.runCommand({datasize:chunk.ns, keyPattern:key, min:chunk.min, max:chunk.max, estimate:est});
        // printjson(dataSizeResult); // uncomment to see how long it takes to run and status           
        print(chunk._id+","+chunk.shard+","+dataSizeResult.size+","+dataSizeResult.numObjects); 
        totalSize += dataSizeResult.size;
        totalChunks++;
        if (dataSizeResult.size == 0) { totalEmpty++ }; //count empty chunks for summary
        }
    )
    print("***********Summary Chunk Information***********");
    print("Total Chunks: "+totalChunks);
    print("Average Chunk Size (bytes): "+(totalSize/totalChunks));
    print("Empty Chunks: "+totalEmpty);
    print("Average Chunk Size (non-empty): "+(totalSize/(totalChunks-totalEmpty)));
}

It's pretty basic at the moment, but it does the job. I have also added it on github and may expand it further there. For now though, it will do what is needed. On the test data set described at the start, the output looks like this (some data removed for brevity):

mongos> AllChunkInfo("chunkTest.foo", true);
ChunkID,Shard,ChunkSize,ObjectsInChunk
chunkTest.foo-_id_MinKey,shard0000,0,0
chunkTest.foo-_id_0.0,shard0000,599592,10707
chunkTest.foo-_id_10707.0,shard0000,1147832,20497
chunkTest.foo-_id_31204.0,shard0000,771568,13778
chunkTest.foo-_id_44982.0,shard0000,771624,13779
// omitted some data for brevity
chunkTest.foo-_id_940816.0,shard0000,1134224,20254
chunkTest.foo-_id_961070.0,shard0000,1145032,20447
chunkTest.foo-_id_981517.0,shard0000,1035104,18484
***********Summary Chunk Information***********
Total Chunks: 41
Average Chunk Size (bytes): 1365855.024390244
Empty Chunks: 1
Average Chunk Size (non-empty): 1400001.4

To explain the arguments passed to the function:

The first argument is the namespace to examine (a string), and the second (a boolean) is whether or not to use the estimate option or not. For any production environment it is recommended to use estimate:true - if it is not used, the all the data will need to be examined, and that means pulling it into memory, which will be expensive.

While the estimate:true version is not free (it uses counts and average object sizes), it is at least reasonable to run even on a large data set. The estimate version can also be a little off if object size is skewed on some shards and hence the average size is not representative (this is generally pretty rare).

MongoDB Balancer:move data inside replication set or across replication set

1 - Yes, the balancing operation can affect the performance, that's why you can schedule the balancing window. By default, the balancer is enabled and may run at any time.

2 - The balancer moves data chunks between shards.

Just to clarify, each shard is a replica set. All members of the replica set (the primary and N secondaries) have the same data (replicated). Therefore, each shard has a different subset of the total data.

Best Answer

Related Solutions

MongoDB Sharding – How to Determine Chunk Distribution in a Sharded Cluster

MongoDB Balancer:move data inside replication set or across replication set

Related Question