Ubuntu – How should I configure Ubuntu/Upstart for unusual network configuration

networkingserverupstartvirtualbox

I recently installed Ubuntu Utopic 14.04 LTS on a new server box I built specifically to host some virtual machines. The network configuration for this box, which contains two NICs, exposes the two NIC's only through virtual bridges – one to a private network, one to the public-facing Internet. One guest VM will access both bridges via taps, serving as the firewall and gateway for the host in particular and the private network in general. The other VM will simply be a separate guest server on the private network. The host will only directly participate on the private network via the corresponding private bridge.

As a result, neither eth0 nor eth1 will only be "up" other than the context of their corresponding virtual bridges. When Ubuntu boots, however, I believe upstart's failsafe is incorrectly assuming (insisting?) that at least eth0 be up independently before it will allow the system to get past the 20/40/60 second delays failsafe imposes. Yet the delays have almost no hope of being resolved until boot finishes and the guest VM's are allowed to start unfettered! See the paradox? To be honest, I'm not sure eth0 nor eth1 will ever reach the state failsafe is demanding.

At a raw, reactionary level, the frustrated, non-Ubuntu side of me wants to rip out failsafe, because each reboot for a configuration change is forcing me to wait up to two minutes for a status change that I'm 99.9% sure will never happen by design. Bottom line – no failsafe dependency. I'd just like to make the extra hoops I'm realizing failsafe is forcing just go away.

By the same token, I'm trying to be at least somewhat open-minded about what Upstart is trying to do with failsafe, as this is my first exposure to it. I've seen some (very vague) info that one approach to this involves changing the way /etc/network/interfaces is set up, moving my bridge setups into their own Upstart tasks, but I would really prefer to leave my interface definitions alone, happy, and working.

So, what are my choices? Can I just eliminate the failsafe tasking, or modify it to change its conditions? If so, how? Must I hack up my interfaces file?

Best Answer

First, let me apologize for answering my own question.

Second, I have, in fact, conquered the failsafe.conf startup delay problem. While I realize there's been no torrent of activity on this question, I've seen enough activity on various other threads about similar failsafe/boot delay problems that I'm posting my research and solution for the benefit of others in a similar pickle.

Overview

As noted in the initial post, the problem as I saw it was one where the failsafe upstart job was imposing an unwanted constraint on the booting of my system. I then researched the issue further, found out why failsafe was behaving as it was.

Analysis

By default, failsafe.conf defines a start condition that effectively fires it at boot time (as soon as filesystem and the loopback interface are available), and defines one of two possible stop conditions:

start on filesystem and net-device-up IFACE=lo
stop on static-network-up or starting rc-sysinit

Failsafe's insistence upon the delays arose by virtue of neither 'stop' event firing. The second condition, rc-sysinit, is one of the final system initialization tasks upstart runs, which has its own start condition

start on (filesystem and static-network-up) or failsafe-boot

With failsafe not stopping, it's apparent rc-sysinit is not starting. Failsafe will emit the failsafe-boot event once its timeouts expire. Given failsafe has started, 'filesystem' is implied, thus leaving the sole remaining condition common to both events being 'static-network-up'. Failsafe is running because it doesn't think any network interfaces are 'up.'

The cause

Working backward through /etc/network/if-up.d, an upstart script is defined that iterates through all the network interfaces defined in /etc/network/interfaces defined with an "auto" qualifier, meaning that interface is to be brought up at boot time. The definition of how an interface is considered 'up' becomes an important semantic issue I'll describe later.

If and only if all "auto"-configured interfaces are 'up', the upstart script will emit the famed 'static-network-up' event. That would, in turn, allow rc-sysinit to fire and terminate failsafe - hence the root cause of my problem. None of my network interfaces have an IP address at boot time - by design. But 'static-network-up' doesn't abide the idea of an interface being 'up' without an IP address, hence failsafe hangs until timeouts expire.

For my situation, I slave the two physical NIC's in the box to bridges and expose them via taps to two different VM's. One VM serves up DHCP across one tap, the other is just a server on the same network. For the bridges to function properly as tapped by the VM's, the NIC's must at least be "UP", passively allowing packets through. Hence, 'auto' seemed appropriate in /etc/network/interfaces. It was not appropriate, however, in the eyes of failsafe, hence the only solution had to be one that abided failsafe's semantics.

The solution to my problem, then, was twofold:

  1. Remove the 'auto' declaration from every network interface I'd defined (other than loopback).
  2. Create upstart jobs to bring up the previously "auto" interfaces "manually."

I defined one job four each of four devices - two taps and two virtual bridges - by mimicking a solution provided here.

In this configuration, with no 'auto' interfaces, the networking script should now immediately emit 'static-network-up', thus forcing failsafe to terminate. A final modification required me to add a "post-up" clause to each tap's interface definition to call 'brctl' and create the corresponding virtual bridge, previously done as part of the 'auto' configuration.

So, my /etc/network/interfaces (in part) now looks like:

#auto tpRED  (commented out)
  iface tpRED inet manual
  pre-up /usr/sbin/tunctl -t tpRED
  post-up /sbin/brctl addbr brRED

#auto brRED
  iface brRED inet manual
  bridge_ports eth1 tpRED
  bridge_hw xx:yy:aa:bb:cc:dd

The acid test

The acid test? Reboot the server. And when I did, the failsafe timeout was gone, and my network came up in a functionally identical configuration. IT WORKS!! I just wish we had a better handle on the semantics of an "UP" network interface!!