Bash – Howto terminate xvfb-run properly

bashxvfb

In order to perform some JavaScript unit tests with karma inside a docker container (based on ubuntu 14.04) I'm starting firefox in the container using a karma-script-launcher with xvfb-run. The start script looks like this:

#!/bin/bash
set -o errexit 

# nasty workaround as xvfb-run doesn't cleanup properly...
trap "pkill -f /usr/lib/firefox/firefox" EXIT

xvfb-run --auto-servernum --server-args='-screen 0, 1024x768x16' firefox $1

Starting the browser and executing the unit tests works very well. After executing the tests karma terminates the spawned browser instance – in my case the script that launched firefox over xvfb-run.

In the above script you can see that I registered a trap to kill the launched firefox on exit of my script. This works, but the script is not a very nice citizen as it terminates all instances of firefox that are currently running instead of just terminating the one instance that was launched by the script. I first tried to kill the xfvb-run process but killing this process has no effect on the sub-process launched by the xvfb-run script…

If I start firefox over xvfb-run manually there is a bunch of spawned processes:

root@1d7a5988e521:/data# xvfb-run --auto-servernum --server-args='-screen 0, 1024x768x16' firefox &
[1] 348
root@1d7a5988e521:/data# ps ax
  PID TTY      STAT   TIME COMMAND
    1 ?        Ss     0:00 bash
  348 ?        S      0:00 /bin/sh /usr/bin/xvfb-run --auto-servernum --server-args=-screen 0, 1024x768x16 firefox
  360 ?        S      0:00 Xvfb :99 -screen 0, 1024x768x16 -nolisten tcp -auth /tmp/xvfb-run.bgMEuq/Xauthority
  361 ?        Sl     0:00 /usr/lib/firefox/firefox
  378 ?        S      0:00 dbus-launch --autolaunch bcf665e095759bae9fc1929b57455cad --binary-syntax --close-stderr
  379 ?        Ss     0:00 //bin/dbus-daemon --fork --print-pid 5 --print-address 7 --session
  388 ?        S      0:00 /usr/lib/x86_64-linux-gnu/gconf/gconfd-2
  414 ?        R+     0:00 ps ax
root@1d7a5988e521:/data#

If I now kill the xvfb-run process (PID 348), only this process will be terminated, leaving the other processes running. If I kill the firefox process (PID 361) instead, the xvfb-run script correctly terminates and kills the other processes as well. But from my script I only know the PID of the xvfb-run process…

During my research I stumbled across this rather old bug report for xvfb-run which still seems to be valid in spite of the bug's status beeing fixed back in 2012.

Is there any polite way to terminate the xvfb-run process in order for the other processes to be cleaned up correctly?

I already asked this on Stack Overflow, but got no answer till now. Perhaps it's somewhat OT for Stack Overflow but better located here?!

Best Answer

It sounds like you are only using xvfb-run for its --auto-servernum functionality.

As @meuh pointed out: that logic is actually pretty simple:

# Copyright (C) 2005 The T2 SDE Project
# Copyright (C) XXXX - 2005 Debian
# GNU GPLv2
find_free_servernum() {
    # Sadly, the "local" keyword is not POSIX.  Leave the next line commented in
    # the hope Debian Policy eventually changes to allow it in /bin/sh scripts
    # anyway.
    #local i

    i=$SERVERNUM
    while [ -f /tmp/.X$i-lock ]; do
        i=$(($i + 1))
    done
    echo $i
}

With that function defined: you could try an invocation like this instead of using xvfb-run:

Xvfb :$(find_free_servernum) -screen 0, 1024x768x16 firefox $1 &
THE_PID=$!
# kill Xvfb whenever you feel like it
kill -15 $THE_PID

With xvfb-run removed: we no longer need to worry about how to kill xvfb-run.

Related Solutions

Bash – Why can’t I kill a timeout called from a Bash script with a keystroke

Signal keys such as Ctrl+C send a signal to all processes in the foreground process group.

In the typical case, a process group is a pipeline. For example, in head <somefile | sort, the process running head and the process running sort are in the same process group, as is the shell, so they all receive the signal. When you run a job in the background (somecommand &), that job is in its own process group, so pressing Ctrl+C doesn't affect it.

The timeout program places itself in its own process group. From the source code:

/* Ensure we're in our own group so all subprocesses can be killed.
   Note we don't just put the child in a separate group as
   then we would need to worry about foreground and background groups
   and propagating signals between them.  */
setpgid (0, 0);

When a timeout occurs, timeout goes through the simple expedient of killing the process group of which it is a member. Since it has put itself in a separate process group, its parent process will not be in the group. Using a process group here ensures that if the child application forks into several processes, all its processes will receive the signal.

When you run timeout directly on the command line and press Ctrl+C, the resulting SIGINT is received both by timeout and by the child process, but not by interactive shell which is timeout's parent process. When timeout is called from a script, only the shell running the script receives the signal: timeout doesn't get it since it's in a different process group.

You can set a signal handler in a shell script with the trap builtin. Unfortunately, it's not that simple. Consider this:

#!/bin/sh
trap 'echo Interrupted at $(date)' INT
date
timeout 5 sleep 10
date

If you press Ctrl+C after 2 seconds, this still waits the full 5 seconds, then print the “Interrupted” message. That's because the shell refrains from running the trap code while a foreground job is active.

To remedy this, run the job in the background. In the signal handler, call kill to relay the signal to the timeout process group.

#!/bin/sh
trap 'kill -INT -$pid' INT
timeout 5 sleep 10 &
pid=$!
wait $pid

Bash – Script dies when parent process is terminated

On a Centos 7 test system via

$ sudo rpm -Uvh https://packages.microsoft.com/config/rhel/7/packages-microsoft-prod.rpm
$ sudo yum install dotnet-sdk-2.1

which results in dotnet-sdk-2.1-2.1.400-1.x86_64 being installed then with the test code

using System;
using System.Diagnostics;
using System.ComponentModel;
namespace myApp {
    class Program {
        static void Main(string[] args) {
            var process = new Process();
            process.EnableRaisingEvents = true; // to avoid [defunct] sh processes
            process.StartInfo.FileName = "/var/tmp/foo";
            process.StartInfo.Arguments = "";
            process.StartInfo.UseShellExecute = true;
            process.StartInfo.CreateNoWindow = true;
            process.Start();
            process.WaitForExit(10000);
            if (process.HasExited) {
                Console.WriteLine("Exit code: " + process.ExitCode);
            } else {
                Console.WriteLine("Child process still running after 10 seconds");
            }
        }
    }
}

and a shell script as /var/tmp/foo a strace stalls out and shows that /var/tmp/foo is run through xdg-open which on my system does...I'm not sure what, it seems a needless complication.

$ strace -o foo -f dotnet run
Child process still running after 10 seconds
^C
$ grep /var/tmp/foo foo
25907 execve("/usr/bin/xdg-open", ["/usr/bin/xdg-open", "/var/tmp/foo"], [/* 37 vars */] <unfinished ...>
...

a simpler solution is to simply exec a program that in turn can be a shell script that does what you want, which for .NET requires not using the shell:

            process.StartInfo.UseShellExecute = false;

with this set the strace shows that /var/tmp/foo is being run via a (much simpler) execve(2) call:

26268 stat("/var/tmp/foo", {st_mode=S_IFREG|0755, st_size=37, ...}) = 0
26268 access("/var/tmp/foo", X_OK)      = 0
26275 execve("/var/tmp/foo", ["/var/tmp/foo"], [/* 37 vars */] <unfinished ...>

and that .NET refuses to exit:

$ strace -o foo -f dotnet run
Child process still running after 10 seconds
^C^C^C^C^C^C^C^C

because foo replaces itself with something that ignores most signals (notably not USR2, or there is always KILL (but avoid using that!)):

$ cat /var/tmp/foo
#!/bin/sh
exec /var/tmp/stayin-alive
$ cat /var/tmp/stayin-alive
#!/usr/bin/perl
use Sys::Syslog;
for my $s (qw(HUP INT QUIT PIPE ALRM TERM CHLD USR1)) {
   $SIG{$s} = \&shandle;
}
openlog( 'stayin-alive', 'ndelay,pid', LOG_USER );
while (1) {
    syslog LOG_NOTICE, "oh oh oh oh oh stayin alive";
    sleep 7;
}
sub shandle {
    syslog LOG_NOTICE, "nice try - @_";
}

daemonize

With a process that disassociates itself from the parent and a shell script that runs a few commands (hopefully equivalent to your intended apt-get update; apt-get upgrade)

$ cat /var/tmp/a-few-things
#!/bin/sh
sleep 17 ; echo a >/var/tmp/output ; echo b >/var/tmp/output

we can modify the .NET program to run /var/tmp/solitary /var/tmp/a-few-things

            process.StartInfo.FileName = "/var/tmp/solitary";
            process.StartInfo.Arguments = "/var/tmp/a-few-things";
            process.StartInfo.UseShellExecute = false;

which when run causes the .NET program to exit fairly quickly

$ dotnet run
Exit code: 0

and, eventually, the /var/tmp/output file does contain two lines written by a process that was not killed when the .NET program when away.

You probably should save the output from the APT commands somewhere, and may also need something so that two (or more!) updates are not trying to be run at the same time, etc. This version does not stop for questions and ignores any TERM signals (INT may also need to be ignored).

#!/bin/sh
trap '' TERM
set -e
apt-get --yes update
apt-get --yes upgrade

Best Answer

Related Solutions

Bash – Why can’t I kill a timeout called from a Bash script with a keystroke

Bash – Script dies when parent process is terminated

daemonize

Related Question