MongoDB crashes with out-of-memory or is being killed by oom-killer

crashgoogle-cloud-platformmongodbmongodb-3.6sharding

A two shards MongoDB database regularly crashes with out-of-memory error or is being killed by the oom-killer. The system runs on GCE Debian 9.4 with MongoDB v3.6.5, WiredTiger storage engine and without swap (as is the practice on GCE). The servers are n1-highmem-4 (4 vCPUs, 26 GB memory). On the server runs just mongod and there are no other services. mongos are on different servers.

Usually process exit/crash happens once a day. If mongod process is killed by oom-killer this can be seen in the logs:

Jun 15 14:45:17 server4 kernel: [1731430.432189] Out of memory: Kill process 13130 (mongod) score 980 or sacrifice child
Jun 15 14:45:17 server4 kernel: [1731430.441717] Killed process 13130 (mongod) total-vm:28280536kB, anon-rss:26174876kB, file-rss:0kB, shmem-rss:0kB

Sometimes mongod exits with leaving this in the mongod.log:

2018-06-15T02:14:32.456+0200 F -        [rsSync] out of memory.

 0x55cbc8535751 0x55cbc8534d84 0x55cbc8623b4b 0x55cbc86c665c 0x55cbc70fccff 0x55cbc70f8b02 0x55cbc707b3f1 0x55cbc86449b0 0x7fbbf3507494 0x7fbbf3249acf
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"55CBC6305000","o":"2230751","s":"_ZN5mongo15printStackTraceERSo"},{"b":"55CBC6305000","o":"222FD84","s":"_ZN5mongo29reportOutOfMemoryErrorAndExitEv"},{"b":"55CBC6305000","o":"231EB4B"},{"b":"55CBC6305000","o":"23C165C","s":"_Znam"},{"b":"55CBC6305000","o":"DF7CFF","s":"_ZN5mongo4repl8SyncTail7OpQueueC1Ev"},{"b":"55CBC6305000","o":"DF3B02","s":"_ZN5mongo4repl8SyncTail16oplogApplicationEPNS0_22ReplicationCoordinatorE"},{"b":"55CBC6305000","o":"D763F1","s":"_ZN5mongo4repl10RSDataSync4_runEv"},{"b":"55CBC6305000","o":"233F9B0"},{"b":"7FBBF3500000","o":"7494"},{"b":"7FBBF3161000","o":"E8ACF","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.5", "gitVersion" : "a20ecd3e3a174162052ff99913bc2ca9a839d618", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.9.0-6-amd64", "version" : "#1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07)", "machine" : "x86_64" }, "somap" : [ { "b" : "55CBC6305000", "elfType" : 3, "buildId" : "7D4592BDFAA6C15459D2319DEAB7F10E9EB4E7D7" }, { "b" : "7FFC48D98000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "A3207CC9FE1CAA3374AE7061AA5C3C5619B8A0E5" }, { "b" : "7FBBF4743000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "713D47D5F599289C0A91ADE8F0122B2B4AA78B2E" }, { "b" : "7FBBF42B0000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1", "elfType" : 3, "buildId" : "2CFE882A331D7857E9CE1B5DE3255E6DA76EF899" }, { "b" : "7FBBF4044000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.1", "elfType" : 3, "buildId" : "E2AA3B39763D943F56B3BD05C8E36E639BA95E12" }, { "b" : "7FBBF3E40000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "B895F0831F623C5F23603401D4069F9F94C24761" }, { "b" : "7FBBF3C38000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "5D83E0642E645026DBB11F89F7DF7106BD821495" }, { "b" : "7FBBF3934000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1B95E3A8B8788B07E4F59EE69B1877F9DEB42033" }, { "b" : "7FBBF371D000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "51AD5FD294CD6C813BED40717347A53434B80B7A" }, { "b" : "7FBBF3500000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "4285CD3158DDE596765C747AE210AB6CBD258B22" }, { "b" : "7FBBF3161000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "AA889E26A70F98FA8D230D088F7CC5BF43573163" }, { "b" : "7FBBF495A000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "263F909DBE11A66F7C6233E3FF0521148D9F8370" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x55cbc8535751]
 mongod(_ZN5mongo29reportOutOfMemoryErrorAndExitEv+0x84) [0x55cbc8534d84]
 mongod(+0x231EB4B) [0x55cbc8623b4b]
 mongod(_Znam+0x21C) [0x55cbc86c665c]
 mongod(_ZN5mongo4repl8SyncTail7OpQueueC1Ev+0x7F) [0x55cbc70fccff]
 mongod(_ZN5mongo4repl8SyncTail16oplogApplicationEPNS0_22ReplicationCoordinatorE+0x402) [0x55cbc70f8b02]
 mongod(_ZN5mongo4repl10RSDataSync4_runEv+0x111) [0x55cbc707b3f1]
 mongod(+0x233F9B0) [0x55cbc86449b0]
 libpthread.so.0(+0x7494) [0x7fbbf3507494]
 libc.so.6(clone+0x3F) [0x7fbbf3249acf]
-----  END BACKTRACE  -----

We have tried to play with the cacheSizeGB parameter and reduced it to 10GB:

  cacheSizeGB: 10.0

But still the crash happens.

It is worth mentioning that there is a chunk moving process underway and that mongod instances that crash are the one from the shard that receive chunks.

What tuning options to use to avoid such crashes?

UPDATE: added small swap od 1GB, but still out-of-memory happens.

Best Answer

It turned out that we had a long running daily query and chunks that were moved were still retained in memory as were used by the cursor of the query. In this way with every chunk moved memory with chunk data was retained and at one moment all memory consumed.

As we removed this long-running query, there were no crashes.