System Calls – Rationale Behind EINTR

posixprocesssignalssystem-calls

Small talk as background

EINTR is the error which so-called interruptible system calls may return. If a signal occurs while a system call is running, that signal is not ignored and a signal handler was defined for it with SA_RESTART not set and this handler handles that signal, then the system call will return the EINTR error code.

As a side note, I got this error very often using ncurses in Python.

The question

Is there a rationale behind this behaviour specified by the POSIX standard? One can understand it may be not possible to resume (depending on the kernel design), however, what's the rationale for not restarting it automatically at the kernel level? Is this for legacy or technical reasons? If this is for technical reasons, are these reasons still valid nowadays? If this is for legacy reasons, then what's the history?

Best Answer

It is difficult to do nontrivial things in a signal handler, since the rest of the program is in an unknown state. Most signal handlers just set a flag, which is later checked and handled elsewhere in the program.

Reason for not restarting the system call automatically:

Imagine an application which receives data from a socket by the blocking and uninterruptible recv() system call. In our scenario, data comes very slow and the program resides long in that system call. That program has a signal handler for SIGINT that sets a flag (which is evaluated elsewhere), and SA_RESTART is set that the system call restarts automatically. Imagine that the program is in recv() which waits for data. But no data arrives. The system call blocks. The program now catches ctrl-c from the user. The system call is interrupted and the signal handler, which just sets the flag is executed. Then recv() is restarted, still waiting for data. The event loop is stuck in recv() and has no opportunity to evaluate the flag and exit the program gracefully.

With SA_RESTART not set:

In the above scenario, when SA_RESTART is not set, recv() would recieve EINTR instead of being restarted. The system call exits and thus can continue. Off course, the program should then (as early as possible) check the flag (set by the signal handler) and do clean up or whatever it does.

The OS implementation view

Consider what happens if a system call is interrupted by a signal. The signal handler will execute user-mode code. But the syscall handler is kernel code and does not trust any user-mode code. So let's explore the choices for the syscall handler:

Terminate the system call; report how much was done to the user code. It's up to the application code to restart the system call in some way, if desired. That's how unix works.
Save the state of the system call, and allow the user code to resume the call. This is problematic for several reasons:
- While the user code is running, something could happen to invalidate the saved state. For example, if reading from a file, the file might be truncated. So the kernel code would need a lot of logic to handle these cases.
- The saved state can't be allowed to keep any lock, because there's no guarantee that the user code will ever resume the syscall, and then the lock would be held forever.
- The kernel must expose new interfaces to resume or cancel ongoing syscalls, in addition to the normal interface to start a syscall. This is a lot of complication for a rare case.
- The saved state would need to use resources (memory, at least); those resources would need to be allocated and held by the kernel but be counted against the process's allotment. This isn't insurmountable, but it is a complication.
  - Note that the signal handler might make system calls that themselves get interrupted; so you can't just have a static resource allotment that covers all possible syscalls.
  - And what if the resources cannot be allocated? Then the syscall would have to fail anyway. Which means the application would need to have code to handle this case, so this design would not simplify the application code.
Remain in progress (but suspended), create a new thread for the signal handler. This, again, is problematic:
- Early unix implementations had a single thread per process.
- The signal handler would risk overstepping on the syscall's shoes. This is an issue anyway, but in the current unix design, it's contained.
- Resources would need to be allocated for the new thread; see above.

The main difference with an interrupt is that the interrupt code is trusted, and highly constrained. It's usually not allowed to allocate resources, or run forever, or take locks and not release them, or do any other kind of nasty things; since the interrupt handler is written by the OS implementer himself, he knows that it won't do anything bad. On the other hand, application code can do anything.

The application design view

When an application is interrupted in the middle of a system call, should the syscall continue to completion? Not always. For example, consider a program like a shell that's reading a line from the terminal, and the user presses Ctrl+C, triggering SIGINT. The read must not complete, that's what the signal is all about. Note that this example shows that the read syscall must be interruptible even if no byte has been read yet.

So there must be a way for the application to tell the kernel to cancel the system call. Under the unix design, that happens automatically: the signal makes the syscall return. Other designs would require a way for the application to resume or cancel the syscall at its leasure.

The read system call is the way it is because it's the primitive that makes sense, given the general design of the operating system. What it means is, roughly, “read as much as you can, up to a limit (the buffer size), but stop if something else happens”. To actually read a full buffer involves running read in a loop until as many bytes as possible have been read; this is a higher-level function, fread(3). Unlike read(2) which is a system call, fread is a library function, implemented in user space on top of read. It's suitable for an application that reads for a file or dies trying; it's not suitable for a command line interpreter or for a networked program that must throttle connections cleanly, nor for a networked program that has concurrent connections and doesn't use threads.

The example of read in a loop is provided in Robert Love's Linux System Programming:

ssize_t ret;
while (len != 0 && (ret = read (fd, buf, len)) != 0) {
  if (ret == -1) {
    if (errno == EINTR)
      continue;
    perror ("read");
    break;
  }
  len -= ret;
  buf += ret;
}

It takes care of case i and case ii and few more.

Why is ‘init 6’ the reboot command? (historic reasons)

init 6 is the (or, a) reboot command because of the historical definitions of "runlevels", or general system states in which a host can be expected to be. These are generally defined as:

0 - Shut down / System halt
1 - Single User mode
2 - Reserved for administrative use
3 - Multi-User mode with networking and services
4 - Reserved for administrative use
5 - Multi-User mode with networking, services, and GUI login daemon
6 - Reboot

The init command tells the system to move to the specified runlevel. Because 6 is the commonly defined runlevel used to reboot the host, and init 6 (or telinit 6) is the means to go to that runlevel, this is why init 6 is generally understood to be a reboot command.

Technically speaking, because these can be redefined by a crafty or bored system administrator, it might be more advisable to use shutdown -r as a reboot command. This is in part because some distributions (e. g. Gentoo) eschew this convention entirely, and because of the proliferating deprecation of the System V Init system in favor of upstart and other "PID 1" daemons.

Small talk as background

The question

Best Answer

Related Solutions

Interruption of system calls when a signal is caught

The OS implementation view

The application design view

Why is ‘init 6’ the reboot command? (historic reasons)

Related Question