I've been playing with a large dataset, consisting of a few tens of billions of rows of small ascii strings. I want to count the number of occurrences of each string in the dataset, and i suspect the unique number of strings is about 1 billion. I've been upserting them into a collection in a MongoDB 4.0 database on a three 32GB RAM
machines in a replica set, with the WiredTiger storage engine. Things go pretty ok, until i hit the magical number of 200.000.000
rows. After that number, the insert speed starts to grind to a halt. I bulk upsert a chunk of 5000 strings at a time, and some operations take 1 second, but once in a while, an operation takes as much as 40 seconds or more. Replication lag also starts to shoot up.
Examining db.stats()
, i see that the only index, _id
, takes about 8GB
of memory; storage size is about 6GB
; wired tiger cache size default is half of ram, so about 15GB
. All of these add up to almost 29GB
. I'm guessing that the oplog also takes a few GB as well.
Am i running out of RAM? I was expecting to be able to load the whole dataset in a shot. I will probably convert the 3 member replica set to 3 shards, as the data is mostly perisable, and i will reduce the wired tiger cache size parameter, but i am curious if i am right about the RAM and the number of documents.
Best Answer
As MongoDB documentation here With WiredTiger, MongoDB utilizes both the WiredTiger internal cache and the filesystem cache.
Starting in 3.4, the
WiredTiger
internal cache, by default, will use the larger of either:Via the filesystem cache, MongoDB automatically uses all free memory that is not used by the WiredTiger cache or by other processes.
Note: The
storage.wiredTiger.engineConfig.cacheSizeGB
limits the size of the WiredTiger internal cache. The operating system will use the available free memory for filesystem cache, which allows the compressed MongoDB data files to stay in memory. In addition, the operating system will use any free RAM to buffer file system blocks and file system cache.To accommodate the additional consumers of RAM, you may have to decrease WiredTiger internal cache size.
The default WiredTiger internal cache size value assumes that there is a single mongod instance per machine. If a single machine contains multiple MongoDB instances, then you should decrease the setting to accommodate the other mongod instances.
To more about the serverstatus then run the command such as
Monitoring applications can run this command at a regular interval to collect statistics about the instance.
The following example includes all repl information in the output:
The output fields vary depending on the version of MongoDB, underlying operating system platform, the storage engine, and the kind of node, including mongos, mongod or replica set member.
For the serverStatus output specific to the version of your MongoDB, refer to the appropriate version of the MongoDB Manual.
For your further ref here , here , here and here