GNU Parallel Limit Memory Usage

gnu-parallelmemorynice

Is it possible to limit the memory usage of all processes started by GNU parallel? I realize there are ways to limit the number of jobs, but in cases where it isn't easy to predict the memory usage ahead of time it can be a difficult to tune this parameter.

In my particular case I'm running programs on a HPC where there are hard limits on process memory. E.g. if there's 72GB of ram available on a node, the batch system will kill jobs that exceed 70GB. I'm also unable to spawn jobs directly to the swap and hold them there.

The GNU parallel package comes with niceload, which seems to allow for the current memory usage to be checked before a process runs. However I'm not sure how to use this.

Best Answer

The short answer is:

ulimit -m 1000000
ulimit -v 1000000

which will limit each process to 1 GB RAM.

Limiting the memory the "right" way is in practice extremely complicated: Let us say you have 1 GB RAM. You start a process every 10 seconds and each process uses 1 MB more every second. So after 140 seconds you will have something like this:

10██▎                                                          
20██████▍                                                      
30██████████▌                                                  
40██████████████▋                                              
50██████████████████▊                                          
60██████████████████████▉                                      
70███████████████████████████                                  
80███████████████████████████████▏                             
90███████████████████████████████████▎                         
100██████████████████████████████████████▍                     
110██████████████████████████████████████████▌                 
120██████████████████████████████████████████████▋             
130██████████████████████████████████████████████████▊         
140██████████████████████████████████████████████████████▉     

This sums up to 1050 MB RAM, so now you need kill something. What is the right job to kill? Is it 140 (assuming it ran amok)? Is it 10 (because it has run the least amount of time)?

In my experience jobs where memory is an issue are typically very predicable (e.g. transforming a bitmap) or very little predictable. For the very predictable ones you can do the computation beforehand and see how many jobs can be run.

For the unpredictable you ideally want the system to start few jobs that take up a lot of memory, and when they are done, you want the system to start more jobs that take up less memory. But you do not know beforehand which jobs will take a lot, which will take a little, and which ones run amok. Some jobs normal life cycle is to run with little memory for a long time and then balloon to a much bigger size later on. It is very hard to tell the difference between those jobs and jobs that run amok.

When someone points me to a well thought out way to do this in a way that will make sense for many applications, then GNU Parallel will probably be extended with that.

Related Question