Every once in a while our mongo service crashes () after displaying log of multiple messages like this:
Sun Apr 6 01:11:38.648 [conn2149] getmore prod_db.tweets query: { query: {}, $snapshot: true } cursorid:2408233643061247141 ntoreturn:0 exhaust:1 keyUpdates:0 numYields: 25 lock
s(micros) r:120718 nreturned:901 reslen:4197371 122ms
Sun Apr 6 01:11:38.769 [conn2149] getmore prod_db.tweets query: { query: {}, $snapshot: true } cursorid:2408233643061247141 ntoreturn:0 exhaust:1 keyUpdates:0 numYields: 22 lock
s(micros) r:73717 nreturned:905 reslen:4196587 1
If I understand correctly, the "numYields" means that it has tried "numYields" times to run the query, but yielded. However, I don't know which process might be blocking it, and not why it is crashing. Any idea?
Best Answer
That query is simply a long running read, without any criteria (so it is running against all data). As it fetches back the data, it will be done in batches (based on your batch size) and then issue a
getmore
on the same cursor for the next set of results.The
numYields
count does not mean the query is being blocked, it means that it yielded its lock when needed. This is usually done for a write, and usually when the original query had to page fault to disk to get data, then it resumes (when querying all data in a collection, this is going to happen often unless all your data + indexes fit in RAM).Therefore, the query is not being blocked, in fact the
getmore
operations show that it is progressing over time - most long running reads will have a similar profile, especially if you are writing to the database at the same time.It is also not likely that this query is the cause of any crash (it's just a read), it's more likely something else that is causing the crash, and you are equating this query with the crash because it happens to be running at the time when the crash occurs (people often suspect the
serverStatus
command for the same reason - it is run once a minute by MMS). I would recommend posting the full messaging around the crash as a separate question for proper diagnosis.For what it's worth, with
snapshot
set to true, and the fact that it is reading all data, I suspect this is amongodump
query (it defaults to using snapshot to avoid duplicates being dumped when data is moved).