MongoDB 4.2 – Oplog Incremental Backup Failure

mongodbmongodb-4.2

I have scheduled oplog incremental dump using mongodump but it fails very often on majority of the servers. I have a shard cluster with around 200GB of oplog size defined and around 30-40 hours of oplog window.

I am thinking it is because of message "WiredTiger record store oplog truncation" just before the dump starts which is common during every failure. Other times, the dump runs perfectly fine. I have analysed the oplog window and size and it is fine. The oplog start data was way older than the timestamp I am using during oplog dump.

Below are some of the logs :

  1. From mongod.log and my backup script :

2020-07-19T16:28:10.472+0000 I STORAGE  [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 420ms

2020-07-19T16:28:10.502+0000 E QUERY    [conn74791] Plan executor error during find command: DEAD, stats: { stage: "COLLSCAN", filter: { $and: [ { ts: { $lte: Timestamp(1595176086, 2244) } }, { ts: { $gt: Timestamp(1595173520, 25) } } ] }, nReturned: 0, executionTimeMillisEstimate: 2970, works: 356598, advanced: 0, needTime: 356597, needYield: 0, saveState: 2787, restoreState: 2787, isEOF: 0, invalidates: 0, direction: "forward", docsExamined: 356596 }```

oplog-prd-mon-XYZ-shard-hd03.c.XYZ-dr.internal-20200719.log-2020-07-19T16:28:06.794+0000 reading password from standard input
oplog-prd-mon-XYZ-shard-hd03.c.XYZ-dr.internal-20200719.log-Enter password:
oplog-prd-mon-XYZ-shard-hd03.c.XYZ-dr.internal-20200719.log-2020-07-19T16:28:07.153+0000 writing local.oplog.rs to stdout
oplog-prd-mon-XYZ-shard-hd03.c.XYZ-dr.internal-20200719.log-2020-07-19T16:28:09.795+0000 local.oplog.rs  0
oplog-prd-mon-XYZ-shard-hd03.c.XYZ-dr.internal-20200719.log-2020-07-19T16:28:10.517+0000 local.oplog.rs  0
oplog-prd-mon-XYZ-shard-hd03.c.XYZ-dr.internal-20200719.log:2020-07-19T16:28:10.517+0000 Failed: error writing data for collection `local.oplog.rs` to disk: error reading collection: Executor error during find command :: caused by :: errmsg: "CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(6850019249918838732)"

Best Answer

You are right, the oplog truncation is killing your cursor.

There are no indexes on the oplog, so dump has to scan from the very beginning to find the documents that match the query.

In that case of this mongodump cursor, it had examined 356596 documents, but still hadn't found any that matched the query.

The failure message reports the last seen record was "RecordId(6850019249918838732)". That number is a 64-bit value that corresponds to the ts.

You can get the first 32 bits and the corresponding timestamp from bash:

% echo $((6850019249918838732 >> 32))
1594894391

That date is Thu Jul 16 10:13:11 UTC 2020 still over a month prior to the first date of interest.

When those documents were trimmed from the capped collection, the cursor no longer pointed to a valid position, so it was not possible to continue.

When that happens, you'll need to restart the mongodump so it can start scanning from the new beginning of the oplog.