About mem and vmem

clustermemoryvirtual-memory

I am working with a cluster machine running under linux.

I have a shell script that uses mpirun to submit my jobs to the cluster machine. In that same script, I can choose the number of nodes that will be assigned to the job. So far, so good.

My issue arises after: when I submit a few jobs, all works well, however, when I fill the capacity of the nodes, some of the submitted jobs won't be completed. I am consequently suspecting that the available memory on the cluster is not sufficient to deal with all of my jobs at the same time.

This is why I want to check the memory usage of each job over time, I then use the qstat -f command, but it displays a lot of things, and most of them I cannot understand.

So here is my question: In the sample output of the qstat -f command below, we can see two types of memory: mem and vmem. I would like to know what is the difference between these two and what is the real amount of memory used?

resources_used.cput = 00:21:04
resources_used.mem = 2099860kb
resources_used.vmem = 40505676kb
resources_used.walltime = 00:21:08

Additionally, I would appreciate any reference where the output of this command is detailed. I tried man qstat but it doesn't go into the details of each returned line.

Best Answer

Just to remove this from the list of open questions and to give a simplified answer (goldilocks 's comment above and the qstat documentation assume deeper familiarity with systems):

The answer depends on what exactly you mean with "the real amount of memory used" (and later in your reply to the comment: "the used RAM space").

"mem" is how much of the RAM of the machine was used by your job, more precisely the observed peak usage. This not necessarily the real peak usage as the job monitoring system on your cluster may only be checking the usage every so often. Your job may be trying to use a lot more memory than reported here but the system is not giving the job more memory, for example because there is no more memory or other tasks running on the same machine compete for memory. Also, the file system cache can be a competitor for RAM if there is heavy file activity (I/O).

"vmem" is a quantity that is related to how modern processors manage memory. Again it is a peak value. This number can include various things processes can access such as memory-mapped files and swap space. It includes space that a process allocated but never used and can therefore be quite big. If the number is low, it allows you to conclude that memory is not the issue but if it is high, you don't know and need to investigate further. Some applications allocate lots of virtual memory even if they only need a fraction.

Related Question