We have three mongo shards with db version v3.2.10.The primary shard is running with very high CPU usage ~675% while other 2 shard node are running ~12% CPU usage.
Here is output for $top command of primary shard:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3825494 mongod 20 0 44.4g 37g 10m S 662.5 62.2 14276:02 mongod
2803 root 20 0 4380 84 0 S 0.3 0.0 71:59.57 rngd
output of $top command for secondary shard:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2504833 mongod 20 0 38.3g 36g 9m S 10.0 60.9 1686:11 mongod
3891605 qateam 20 0 454m 12m 0 S 2.0 0.0 53:34.66 node_exporte
Output of db.currentOp() at shard1 where CPU is high and with more secs_running
1) { "desc" : "rsSync", "threadId" : "140545421088512", "active" : true, "opid" : 76313, "secs_running" : 329997, "microsecs_running" : NumberLong("329997131632"), "op" : "none", "ns" : "local.replset.minvalid", "query" : { }, "numYields" : 0, "locks" : { }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(72414875), "w" : NumberLong(54311146), "R" : NumberLong(18103716), "W" : NumberLong(18103716) }, "acquireWaitCount" : { "R" : NumberLong(132), "W" : NumberLong(3942872) }, "timeAcquiringMicros" : { "R" : NumberLong(18610043), "W" : NumberLong(1490223084) } }, "Database" : { "acquireCount" : { "r" : NumberLong(6), "w" : NumberLong(1), "W" : NumberLong(54311145) }, "acquireWaitCount" : { "W" : NumberLong(1) }, "timeAcquiringMicros" : { "W" : NumberLong(61) } }, "Collection" : { "acquireCount" : { "r" : NumberLong(5) } }, "Metadata" : { "acquireCount" : { "w" : NumberLong(1) } }, "oplog" : { "acquireCount" : { "r" : NumberLong(1), "w" : NumberLong(1) } } } } 2) { "desc" : "WT RecordStoreThread: local.oplog.rs", "threadId" : "140545378072320", "active" : true, "opid" : 715968754, "secs_running" : 1310, "microsecs_running" : NumberLong(1310975777), "op" : "none", "ns" : "local.oplog.rs", "query" : { }, "numYields" : 0, "locks" : { }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(1), "w" : NumberLong(1) } }, "Database" : { "acquireCount" : { "w" : NumberLong(1) } }, "oplog" : { "acquireCount" : { "w" : NumberLong(1) } } } }
How can I control and keep it down to normal CPU usage range.
Best Answer
Because in at normal situation all traffic (read, write) goes to the primary node, it is the busiest node at replica set. Secondaries just replicate changes (update, insert, delete) and not responding to client queries.
But check your I/O.
iostat -mx 1
what are %iowait, %util.iotop
program shows how much you actually read and write to disk. Do you know how many IOPS your disk system can server? MongoDB is very IOPS centric, if mongod cannot have "enough" IOPS, it is going to be "slow". Especially secondaries can start "lagging" if they cannot write disk fast enough. That you can see from the primary withrs.printSlaveReplicationInfo()
command. Secondaries SHOULD stay under 2 seconds behind.