MongoDB – Inserting Records Using Pymongo Exhausts RAM

mongodbpython

I am writing Python code that parses through a huge XML document of size 17GB(Wikipedia collection) and inserts each page title along with the links in that page (storing as a list) present on the page in a MongoDB collection. I am inserting using the command links.insert_one(data) wheredata is a dictionary consisting of the following

data = {
          'document': self.title, # a string (basically title of the Wikipedia page)
          'links': self.links # list of strings (all hyperlinked words in a page)
       }

<db_name>.insert_one(data)

I am repeating this process for every page (around 19 million pages)

When I monitor my RAM usage(using htop in Ubuntu), I see that the amount of RAM used by MongoDB keeps increasing and continues till all my memory is used eventually crashing the system.

I have also made the following observation when I comment out the insertion statement and execute the code, it works fine. So, I presume that the problem is with my MongoDB insertion process and not with the actual python logic.

Any help would be really appreciated.

Best Answer

You can set the WiredTiger cache size, see hardware considerations.

Other than this, run MongoDB on containers (Kubernetes, Docker, rkt), where you can easily control resource usage. If you are using systemd for running MongoDB, you can also set resource usage.