How could running strace be fixing the OpenGL issue

openglsegmentation faultstrace

Since a recent major upgrade to my distribution (PLD Linux), I have been having trouble with a whole slew of programs. As best I can tell, anything that touches OpenGL or PulseAudio segfaults. I'm using the proprietary nvidia drivers and a 3.2.x kernel. Xorg itself runs fine and I am able to run most programs, however things like mplayer segfault and no sound is produced by any program.

Once I figured out that it might be related to OpenGL, I started playing with glxgears as a test. Running it by itself segfaults instantly. Then I discovered that running it under strace runs fine. The same thing is true for mplayer. Running it on a test mp3 file segfaults instantly, running strace mplayer plays just fine (although pulse audio still dies and it reverts to a dummy output device).

How could running something under strace keep it from segfaulting and how would I continue to debug the situation?

Best Answer

I have observed that Nvidia's libGL.so attempts to detect if the current process is being traced, by opening /proc/self/status and looking for "TracerPid:". Different code paths are taken depending upon if the value of TracerPid is non-zero (i.e., is the current processing being traced or not).

Install sysdig, and capture the a trace for the offending process twice, once while stracing, once withouth strace. For example:

$ sysdig -w glxgears.scap proc.name=glxgears &
$ glxgears &
$ kill -TERM `pidof glxgears`
$ kill -TERM `pidof sysdig`
$ sysdig -w glxgears-strace.scap proc.name=glxgears &
$ strace glxgears &
$ kill -TERM `pidof glxgears`
$ kill -TERM `pidof sysdig`

Compare the textual output of the two different traces to observe the change in execution flow between the straced and non-straced runs of glxgears.

strace "fixes" your OpenGL issue, because libGL is behaving differently depending upon if the process is being traced/debugged.

Related Solutions

Running strace for specific period of time

With timeout in GNU coreutils, you can do:

Get the process id
Run timeout 60 strace -p PID

Here is an example.

test.sh:

#!/bin/bash

while :; do
    echo "$$"
    sleep 100
done

Run it:

$ ./test.sh
27121

Run strace with timeout:

% cuonglm at ~
% timeout 60 strace -p 27121
Process 27121 attached - interrupt to quit
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 27311
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
wait4(-1, 0x7fff374b8598, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn(0xffffffffffffffff)        = 0
rt_sigaction(SIGINT, {0x45c4d0, [], SA_RESTORER, 0x7fcdc10e05c0}, {0x443910, [], SA_RESTORER, 0x7fcdc10e05c0}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(1, "27121\n", 6)                  = 6
rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fcdc1a699d0) = 27328
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x443910, [], SA_RESTORER, 0x7fcdc10e05c0}, {0x45c4d0, [], SA_RESTORER, 0x7fcdc10e05c0}, 8) = 0
wait4(-1,

After 1 minute:

....
rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fcdc1a699d0) = 27328
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x443910, [], SA_RESTORER, 0x7fcdc10e05c0}, {0x45c4d0, [], SA_RESTORER, 0x7fcdc10e05c0}, 8) = 0
wait4(-1,  <unfinished ...>
Process 27121 detached
% cuonglm at ~

How to strace monitor itself

I will answer for Linux only.

Surprisingly, in newer kernels, the ptrace system call, which is used by strace in order to actually perform the tracing, is allowed to trace the init process. The manual page says:

   EPERM  The specified process cannot be traced.  This could  be  because
          the  tracer has insufficient privileges (the required capability
          is CAP_SYS_PTRACE); unprivileged  processes  cannot  trace  pro‐
          cesses  that  they  cannot send signals to or those running set-
          user-ID/set-group-ID programs, for  obvious  reasons.   Alterna‐
          tively,  the process may already be being traced, or (on kernels
          before 2.6.26) be init(8) (PID 1).

implying that starting in version 2.6.26, you can trace init, although of course you must still be root in order to do so. The strace binary on my system allows me to trace init, and in fact I can even use gdb to attach to init and kill it. (When I did this, the system immediately came to a halt.)

ptrace cannot be used by a process to trace itself, so if strace did not check, it would nevertheless fail at tracing itself. The following program:

#include <sys/ptrace.h>
#include <stdio.h>
#include <unistd.h>
int main() {
    if (ptrace(PTRACE_ATTACH, getpid(), 0, 0) == -1) {
        perror(NULL);
    }
}

prints Operation not permitted (i.e., the result is EPERM). The kernel performs this check in ptrace.c:

 retval = -EPERM;
 if (unlikely(task->flags & PF_KTHREAD))
         goto out;
 if (same_thread_group(task, current)) // <-- this is the one
         goto out;

Now, it is possible for two strace processes can trace each other; the kernel will not prevent this, and you can observe the result yourself. For me, the last thing that the first strace process (PID = 5882) prints is:

ptrace(PTRACE_SEIZE, 5882, 0, 0x11

whereas the second strace process (PID = 5890) prints nothing at all. ps shows both processes in the state t, which, according to the proc(5) manual page, means trace-stopped.

This occurs because a tracee stops whenever it enters or exits a system call and whenever a signal is about to be delivered to it (other than SIGKILL).

Assume process 5882 is already tracing process 5890. Then, we can deduce the following sequence of events:

Process 5890 enters the ptrace system call, attempting to trace process 5882. Process 5890 enters trace-stop.
Process 5882 receives SIGCHLD to inform it that its tracee, process 5890 has stopped. (A trace-stopped process appears as though it received the `SIGTRAP signal.)
Process 5882, seeing that its tracee has made a system call, dutifully prints out the information about the syscall that process 5890 is about to make, and the arguments. This is the last output you see.
Process 5882 calls ptrace(PTRACE_SYSCALL, 5890, ...) to allow process 5890 to continue.
Process 5890 leaves trace-stop and performs its ptrace(PTRACE_SEIZE, 5882, ...). When the latter returns, process 5890 enters trace-stop.
Process 5882 is sent SIGCHLD since its tracee has just stopped again. Since it is being traced, the receipt of the signal causes it to enter trace-stop.

Now both processes are stopped. The end.

As you can see from this example, the situation of two process tracing each other does not create any inherent logical difficulties for the kernel, which is probably why the kernel code does not contain a check to prevent this situation from happening. It just happens to not be very useful for two processes to trace each other.

Best Answer

Related Solutions

Running strace for specific period of time

How to strace monitor itself

Related Question