Mongodb – 32GB RAM needed to work with 200.000.000 rows in MongoDB

hardwaremongodbmongodb-4.0

I've been playing with a large dataset, consisting of a few tens of billions of rows of small ascii strings. I want to count the number of occurrences of each string in the dataset, and i suspect the unique number of strings is about 1 billion. I've been upserting them into a collection in a MongoDB 4.0 database on a three 32GB RAM machines in a replica set, with the WiredTiger storage engine. Things go pretty ok, until i hit the magical number of 200.000.000 rows. After that number, the insert speed starts to grind to a halt. I bulk upsert a chunk of 5000 strings at a time, and some operations take 1 second, but once in a while, an operation takes as much as 40 seconds or more. Replication lag also starts to shoot up.

Examining db.stats(), i see that the only index, _id, takes about 8GB of memory; storage size is about 6GB; wired tiger cache size default is half of ram, so about 15GB. All of these add up to almost 29GB. I'm guessing that the oplog also takes a few GB as well.

Am i running out of RAM? I was expecting to be able to load the whole dataset in a shot. I will probably convert the 3 member replica set to 3 shards, as the data is mostly perisable, and i will reduce the wired tiger cache size parameter, but i am curious if i am right about the RAM and the number of documents.

Best Answer

As MongoDB documentation here With WiredTiger, MongoDB utilizes both the WiredTiger internal cache and the filesystem cache.

Starting in 3.4, the WiredTiger internal cache, by default, will use the larger of either:

50% of (RAM - 1 GB), or
256 MB.

For example, on a system with a total of 4GB of RAM the WiredTiger cache will use 1.5GB of RAM (0.5 * (4 GB - 1 GB) = 1.5 GB). Conversely, a system with a total of 1.25 GB of RAM will allocate 256 MB to the WiredTiger cache because that is more than half of the total RAM minus one gigabyte (0.5 * (1.25 GB - 1 GB) = 128 MB < 256 MB).

Via the filesystem cache, MongoDB automatically uses all free memory that is not used by the WiredTiger cache or by other processes.

Note: The storage.wiredTiger.engineConfig.cacheSizeGB limits the size of the WiredTiger internal cache. The operating system will use the available free memory for filesystem cache, which allows the compressed MongoDB data files to stay in memory. In addition, the operating system will use any free RAM to buffer file system blocks and file system cache.

To accommodate the additional consumers of RAM, you may have to decrease WiredTiger internal cache size.

The default WiredTiger internal cache size value assumes that there is a single mongod instance per machine. If a single machine contains multiple MongoDB instances, then you should decrease the setting to accommodate the other mongod instances.

To more about the serverstatus then run the command such as

db.runCommand( { serverStatus: 1 } )

Monitoring applications can run this command at a regular interval to collect statistics about the instance.

Note: The value (i.e. 1 above) does not affect the operation of the command.

The following example includes all repl information in the output:

db.runCommand( { serverStatus: 1,  repl: 1 } )

The output fields vary depending on the version of MongoDB, underlying operating system platform, the storage engine, and the kind of node, including mongos, mongod or replica set member.

For the serverStatus output specific to the version of your MongoDB, refer to the appropriate version of the MongoDB Manual.

For your further ref here , here , here and here