Mongodb – Improve the insertion rate of mongo sharded cluster

mongodbnosqlperformance

I can't disable the write concern while i mongorestore with .bson file. It totally damages the insertion rate (insert/sec).

This is my sharding status.

mongos> sh.status()
--- Sharding Status --- 
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("546bd498f0531e27d40544b2")
} shards: { "_id" : "shard2", "host" : "shard2/hadoop5.xx.com:27018" } databases: { "_id" : "admin", "partitioned" : false, "primary" : "config" } { "_id" : "deneme6", "partitioned" : false, "primary" : "shard2" } { "_id" : "seref", "partitioned" : true, "primary" : "shard2" }

This is my shard. I already changed the property of getLastErrorDefaults w:0 for disable write concern.

shard2:PRIMARY> show tables; me oplog.rs startup_log system.indexes system.replset          shard2:PRIMARY> db.me.find() { "_id" : ObjectId("546bd2869e14c01f13708886"), "host" : "hadoop5.xx.com" } 
shard2:PRIMARY> db.system.replset.find() { "_id" : "shard2", "version" : 2, "members" : [ { "_id" : 0, "host" : "hadoop5.xx.com:27018" } ], "settings" : { "getLastErrorDefaults" : { "w" : 0 } } } 
shard2:PRIMARY>

This is my mongorestore command which load data via query router ( mongos )

mongorestore --db seref --collection can11 --noobjcheck --w 0 --noIndexRestore --host       10.21.12.21 --port 27017 /tmp/iabc_kwd.bson

When i try to mongorestore on sharded cluster via mongos the insertion rate is 300k byte / sec , when i try to mongorestore on single mongodb the insertion rate is 12m /sec.

How can i speed up the insertion rate of restoring mongodb sharded cluster?


-- Sharding Status --- 
  sharding version: {
    "_id" : 1,
    "version" : 3,
    "minCompatibleVersion" : 3,
    "currentVersion" : 4,
    "clusterId" : ObjectId("546ca29960e36f8983130015")
}
  shards:
    {  "_id" : "shard0000",  "host" : "hadoop5:27018" }
    {  "_id" : "shard0001",  "host" : "hadoop6:27017" }
    {  "_id" : "shard0002",  "host" : "hadoop4:27017" }
  databases:
    {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
    {  "_id" : "test",  "partitioned" : false,  "primary" : "shard0000" }
    {  "_id" : "seref",  "partitioned" : true,  "primary" : "shard0000" }
    {  "_id" : "abc",  "partitioned" : false,  "primary" : "shard0001" }
    {  "_id" : "mahmut",  "partitioned" : true,  "primary" : "shard0002" }
    {  "_id" : "incee",  "partitioned" : true,  "primary" : "shard0002" }
        incee.onur
            shard key: { "_id" : "hashed" }
            chunks:
                shard0000   1
                shard0002   501
            too many chunks to print, use verbose if you want to force print

All shards computer have 8 GB RAM , and they have 1.3 TB HDD.

Best Answer

The problem (from what I can see)

From what I can see from your output, you have a monotonically increasing shard key like ObjectId or a Date or something similar.

MongoDB sharding is done with key ranges over the selected shard key. Put simply, it works like "shard0000 is supposed to hold the rang from -infinity to $i, shard0001 the keys from $h to $p and shard0002 from $q to infinity. With a monotonically increasing shard key, every new value will be bigger than the last one inserted, heading to infinity, so the shard which will be used is always shard0002, instead of the writes being balanced to all shards.

So all of the write operations seem to hit shard0002 with individual chunks being migrated to the other shards. The necessary chunk migrations together with added network latency and a lot of metadata updates going on, it may well be that you have a poor write performance.

The solution

Really bad news: you can't change the shard key once it is set. You basically have to create new sharded collection with a better shard key. For choosing a proper shard key, please read Considerations for Selecting Shard Keys in the MongoDB docs.

P.S. I wasn't sure wether to answer this question, since it is slightly off topic on Stackoverflow, which is aimed at programming questions. But since usually the developers are supposed to choose the shard key, my intention was to clarify things for them. Additionally, the data modeling might be influenced by sharding considerations, so – by a small margin – I thought it was ok to answer here. However, dear reviewer, feel free to flag this answer for deletion if you think differently. I am really not sure.