MongoDB – Inserting Records Using Pymongo Exhausts RAM

mongodbpython

I am writing Python code that parses through a huge XML document of size 17GB(Wikipedia collection) and inserts each page title along with the links in that page (storing as a list) present on the page in a MongoDB collection. I am inserting using the command links.insert_one(data) wheredata is a dictionary consisting of the following

data = {
          'document': self.title, # a string (basically title of the Wikipedia page)
          'links': self.links # list of strings (all hyperlinked words in a page)
       }

<db_name>.insert_one(data)

I am repeating this process for every page (around 19 million pages)

When I monitor my RAM usage(using htop in Ubuntu), I see that the amount of RAM used by MongoDB keeps increasing and continues till all my memory is used eventually crashing the system.

I have also made the following observation when I comment out the insertion statement and execute the code, it works fine. So, I presume that the problem is with my MongoDB insertion process and not with the actual python logic.

Any help would be really appreciated.

Best Answer

You can set the WiredTiger cache size, see hardware considerations.

Other than this, run MongoDB on containers (Kubernetes, Docker, rkt), where you can easily control resource usage. If you are using systemd for running MongoDB, you can also set resource usage.

Related Solutions

MongoDB Not Using All Available RAM

The resident memory size represents the number of pages in memory actually touched by the mongod process. If that is significantly lower than the available memory and data exceeds the available memory (yours does), then it could be a case of simply not having actively touched enough pages yet.

To determine if this is the case, you should run free -m, the output should look something like this:

free -m
             total       used       free     shared    buffers     cached
Mem:          3709       3484        224          0         84       2412
-/+ buffers/cache:        987       2721
Swap:         3836        156       3680

In my example, cached is not close to the total, which means that not only has mongod not touched enough pages, the filesystem cache has not yet even been filled by pages being read from the disk in general.

A quick remedy for this would be the touch command (added in 2.2) - it should be used with caution on large data sets as it will attempt to load everything into RAM even if the data is far too large to fit (causing a lot of disk IO and page faults). It will certainly fill up the memory effectively though :)

If your cached value is close to the total available, then your issue is that a large number of pages being read into memory from disk are not relevant to (and hence not touched by) the mongod process. The usual candidate for this kind of discrepency is readahead. I've already covered that particular topic elsewhere in detail, so I'll just link those two answers for future reading if necessary.

MongoDB Insert Performance

You want to use bulk write operations instead of individual updates and inserts.

// BEFORE your iteration loop
var bulk = db.another_collection.initializeUnorderedBulkOp();

// INSIDE your iteration loop
// instead of db.another_collection.update(…)
bulk.find(query).update(yourUpdateDocumentHere);

// instead of db.another_collection.insert(…)
bulk.insert(yourDocumentToInsertHere);

// AFTER your iteration loop
bulk.execute()

This should speed up the write operations dramatically.

EDIT: as for the bottleneck, extending arrays is a costly operation, since it forces MongoDB to relocate documents often, since the padding will be exceeded. As a rule of thumb: if you have an array which you have to extend often, there is something wrong with your data model. More often than not, it is worth to revise it.

Related Question