We are using mongo setup with replica sets on AWS.
Setup Details:
1 Primary node: r3.8xlarge
4 secondary nodes: r5.xlarge
Mongo version: 3.0.8 (WiredTiger)
Database size: 358GB
We have configured mongo cloud monitoring for this setup. Here are few stats from it:
Mongo Primary Network: 350MB/S (In: 40MB/S, Out: 310MB/S, Num of requests: 5.32KB/S)
1.1K Connections on average on primary node
Tickets available on primary: Reads: 125, Write: 100
Queues: Total: 15-20, Read : 0-1, Write: 15-20
Other Points:
CPU and memory stats on the instance look pretty much under control.
We are using 800gb GP2 EBS volume(2400 IOPS) and we are consuming around 2000 IOPS. The burst balance is almost available to full capacity which means that also is not exhausted.
Primary node is of r3.8xlarge type, so it has 10gig network.
Ulimits are set as follows:
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 64000 64000 processes
Max open files 64000 64000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 1967994 1967994 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Problem:
We are accessing mongo using Java application and we are seeing that a lot of our queries are taking considerable amount of time such as 2-4 seconds.
We enabled profiling and listed slowest queries. In that we found that a sample query for update object, acquired intent lock 17000 times and had to yield those many times.
We are not able to understand if it is intent lock then why does it need to yield so many times? Does intent lock also not allow other operations to proceed?
If the query has to yield so many times, how can we see the advantage of document level lock?
We have also seen some of our queries getting timed out(probably waiting for the lock too long and eventually dying).
Could you please guide us on how to debug this problem and improve the performance of our mongo ?
Here's the profiler output of a sample query:
{
"op" : "update",
"ns" : "backend.Scores",
"query" : {
"channel.id" : "hkxj",
"metric" : "YVR",
"date" : ISODate("2018-11-20T10:00:00Z")
},
"updateobj" : {
"$set" : {
"channel" : {
"id" : "hkxj",
"failures" : 39,
"items" : [
{
"_id" : ObjectId("5bf3e434800075956f1"),
"image" : "http://images.abcd.com/something/wscoob.jpg",
"b_time" : ISODate("2018-11-26T19:24:00Z"),
"title" : "What's New ",
"id" : "fq969"
},
{
"_id" : ObjectId("5bf3e43f800075956f0"),
"image" : "http://images.abcd.com/something/w4citv2.jpg",
"broadcast_time" : ISODate("2018-11-26T20:24:00Z"),
"title" : "Let's End This",
"id" : "fsfgd"
}
]
},
"metric" : "YVR",
"date" : ISODate("2018-11-20T10:00:00Z")
},
"$setOnInsert" : {
"__v" : 0
}
},
"nscanned" : 0,
"nscannedObjects" : 2209900,
"nMatched" : 1,
"nModified" : 1,
"keyUpdates" : 0,
"writeConflicts" : 0,
"numYield" : 17264,
"locks" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(17266),
"w" : NumberLong(17266)
}
},
"Database" : {
"acquireCount" : {
"w" : NumberLong(17266)
}
},
"Collection" : {
"acquireCount" : {
"w" : NumberLong(17265)
}
},
"oplog" : {
"acquireCount" : {
"w" : NumberLong(1)
}
}
},
"millis" : 3864,
"execStats" : {
},
"ts" : ISODate("2018-11-20T10:40:25.104Z"),
"client" : "172.2.3.52",
"allUsers" : [ ],
"user" : ""
}
Best Answer
The provided output indicates a query that requires a collection scan of 2,209,900 documents to update a single document:
Since a collection scan is a long-running query it will yield multiple times to allow other operations to interleave. See: Does a read or write operation ever yield the lock? in the MongoDB Concurrency FAQ.
You can significantly improve this query by adding a supporting index on
{'channel.id': 1, metric: 1, date: 1}
. Use the query explain option (i.e.db.Scores.find({..}).explain(true)
) to see candidate indexes and usage.Note: to minimise impact in your production environment you'll want to do a background or rolling index build when adding this index. The MongoDB documentation includes more information on procedures and behaviour: Index Build Operations on a Populated Collection.
If your deployment is already stressed, I would caution against enabling the Database Profiler as this will add additional write load and only capture a limited sample of queries (by default profiling uses a 1MB capped collection for each database).
Since slow query information is already available in the MongoDB server logs, I would instead take a log-based approach. Some general tips can be found on ServerFault: Understanding IXSCAN and COLLSCAN in MongoDB logs.
Significant time acquiring locks suggests either contention or a resource constraint (for example, I/O). If you have optimised all of your slow queries you'd have to look into more general metrics and what is logged during periods of notable slowness. Your question description suggests that you aren't concerned about resource limits yet, so logs and serverStatus metrics are a more likely focus.
Overall performance tuning is a much bigger topic than an answer here, however some suggested starting points are: