I am writing Python code that parses through a huge XML document of size 17GB(Wikipedia collection) and inserts each page title along with the links in that page (storing as a list) present on the page in a MongoDB collection. I am inserting using the command links.insert_one(data)
wheredata
is a dictionary consisting of the following
data = {
'document': self.title, # a string (basically title of the Wikipedia page)
'links': self.links # list of strings (all hyperlinked words in a page)
}
<db_name>.insert_one(data)
I am repeating this process for every page (around 19 million pages)
When I monitor my RAM usage(using htop in Ubuntu), I see that the amount of RAM used by MongoDB keeps increasing and continues till all my memory is used eventually crashing the system.
I have also made the following observation when I comment out the insertion statement and execute the code, it works fine. So, I presume that the problem is with my MongoDB insertion process and not with the actual python logic.
Any help would be really appreciated.
Best Answer
You can set the WiredTiger cache size, see hardware considerations.
Other than this, run MongoDB on containers (Kubernetes, Docker, rkt), where you can easily control resource usage. If you are using systemd for running MongoDB, you can also set resource usage.