Linux – Is it wrong to think of “memfd”s as accounted “to the process that owns the file”

linux-kernelout of memoryresourcesshared memory

https://dvdhrm.wordpress.com/2014/06/10/memfd_create2/

Theoretically, you could achieve [memfd_create()] behavior without introducing new syscalls, like this:

int fd = open("/tmp", O_RDWR | O_TMPFILE | O_EXCL, S_IRWXU);

(Note, to more portably guarantee a tmpfs here, we can use "/dev/shm" instead of "/tmp").

Therefore, the most important question is why the hell do we need a third way?

[…]

  • The backing-memory is accounted to the process that owns the file and is not subject to mount-quotas.

^ Am I right in thinking the first part of this sentence cannot be relied on?

The memfd_create() code is literally implemented as an "unlinked file living in [a] tmpfs which must be kernel internal". Tracing the code, I understand it differs in not implementing LSM checks, also memfds are created to support "seals", as the blog post goes on to explain. However, I'm extremely sceptical that memfds are accounted differently to a tmpfile in principle.

Specifically, when the OOM-killer comes knocking, I don't think it will account for memory held by memfds. This could total up to 50% of RAM – the value of the size= option for tmpfs. The kernel doesn't set a different value for the internal tmpfs, so it would use the default size of 50%.

So I think we can generally expect processes which hold a large memfd, but no other significant memory allocations, will not be OOM-killed. Is that correct?

Best Answer

Building on @danblack's answer:

The decision is based on oom_kill_process() (cleaned up a bit):

for_each_thread(p, t) {
        list_for_each_entry(child, &t->children, sibling) {
                unsigned int child_points;

                child_points = oom_badness(child,
                        oc->memcg, oc->nodemask, oc->totalpages);
                if (child_points > victim_points) {
                        put_task_struct(victim);
                        victim = child;
                        victim_points = child_points;
                        get_task_struct(victim);
                }
        }
}

(https://github.com/torvalds/linux/blob/master/mm/oom_kill.c#L974)

Which depends on oom_badness() to find the best candidate:

child_points = oom_badness(child,
        oc->memcg, oc->nodemask, oc->totalpages);

oom_badness() does:

points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
        mm_pgtables_bytes(p->mm) / PAGE_SIZE;

(https://github.com/torvalds/linux/blob/master/mm/oom_kill.c#L233)

Where:

static inline unsigned long get_mm_rss(struct mm_struct *mm)
{
        return get_mm_counter(mm, MM_FILEPAGES) +
                get_mm_counter(mm, MM_ANONPAGES) +
                get_mm_counter(mm, MM_SHMEMPAGES);
}

(https://github.com/torvalds/linux/blob/master/mm/oom_kill.c#L966)

So it looks that it counts anonymous pages, which is what memfd_create() uses.

Related Question