Mongodb – WiredTiger panic when attempting to rename .turtle leads to loss of all data

mongodbmongodb-3.2wiredtiger

We've seen this twice this week on separate servers (one staging, one production).

2017-10-19T12:50:37.525-0400 I ACCESS   [conn266] Successfully authenticated as principal ********* on admin
2017-10-19T13:00:42.782-0400 E STORAGE  [thread2] WiredTiger (-28817) [1508432442:782769][1520:8790690042448], file:WiredTiger.wt, WT_SESSION.checkpoint: c:\mongo\data\WiredTiger.turtle.set to c:\mongo\data\WiredTiger.turtle: file-rename: rename: Cannot create a file when that file already exists.
2017-10-19T13:00:42.784-0400 E STORAGE  [thread2] WiredTiger (-28817) [1508432442:784770][1520:8790690042448], checkpoint-server: checkpoint server error: Cannot create a file when that file already exists.
2017-10-19T13:00:42.785-0400 E STORAGE  [thread2] WiredTiger (-31804) [1508432442:785770][1520:8790690042448], checkpoint-server: the process must exit and restart: WT_PANIC: WiredTiger library panic
2017-10-19T13:00:42.785-0400 I -        [thread2] Fatal Assertion 28558
2017-10-19T13:00:42.785-0400 I -        [thread2] 

***aborting after fassert() failure

2017-10-19T13:00:42.805-0400 I -        [conn259] Fatal Assertion 28559
2017-10-19T13:00:42.806-0400 I -        [conn259] 

***aborting after fassert() failure

The result is that the admin database is wiped out. To recover we need to disable authentication and recreate all users. The original data files seem to contain data, but it's no longer visible from the shell commands.

Both servers have been running fine, independently for months. They are running the same version of Mongo but very different versions of the software which interacts with it.

I suppose I have two questions:

1) What might cause this particular failure?
c:\mongo\data\WiredTiger.turtle apparently shouldn't exist at this point, but it does; what can cause this? They have extensive security protocols including Bit9 and anti-virus solutions. I suggested a scheduled defrag job (they're running SSDs so defrag should NEVER run, but hey…) but they can't find any evidence that a defrag happened on either server.

2) What might cause them to experience the same crash a day or two apart?
The only thing in common on the two systems is Mongo. As an aside, we're going to suggest journaling and replication (their production environment is still in its test rollout, so the downtime and loss of data was acceptable this time).

Best Answer

What version you are using? This is known bug, what has been fixed at versions 3.2.13, 3.4.4, 3.5.6 forward.

Related Question