Bash – Notification Listener Behaving Strangely when Backgrounded

background-processbashshell-scriptstdinstdout

I've been struggling to get my IRC-Bouncer/Notification Script working.

This is meant to be a script which will automatically login to a remote machine, attach to a screen session (or start one if none currently exists) running weechat, while simultaneously opening another ssh connection which uses netcat to read notifications from a socket-file to which a weechat add-on-script exports my notification messages. These notifications are then fed into lib-notify (via notify-send) so that I can be alerted of activity in weechat.

Here's the script:

#!/bin/bash

BOUNCER="MYUSER@MYBOUNCER.NET"

function irc_notify() {
  ssh $BOUNCER "nc -k -l -U /tmp/weechat.notify.sock" | \
    while read type message; do
     notify-send -i weechat -u critical "$(echo -n $type | base64 -di -)" "$(echo -n $message | base64 -di -)"
    done
}

# Start listening for notifications
irc_notify &

# Attach to remote IRC Bouncer Screen
ssh $BOUNCER -t 'screen -d -R -S irc weechat'

# Cleanup Socket Listener
echo "cleaning up notification socket listener…"
ssh $BOUNCER 'pkill -f -x "nc -k -l -U /tmp/weechat.notify.sock"'

The setup actually works really well, except for one major glitch. Only two notifications were making it through to my notification manager, per invocation of the script. After that: nothing.

So to eliminate issues with the notification script in weechat, I removed the second ssh invocation (the one which attaches to the screen session and starts weechat) and replaced it with a read command to block execution while I tested. Then, using irb on the remote machine, I sent messages to the socket with ruby.
However, even when I was sending the messages manually, still only two messages would appear before it stopped working.

strace showed me some interesting behavior (when I attached to the forked process) where it seemed that the messages stopped being terminated by new-line characters after the first or second message. But after a few more, they stopped appearing in strace all-together.

At this point I decided to see if there was something in my script that was causing strange behavior. So on the command-line I simply invoked the ssh connection (ssh $BOUNCER "nc -k -l -U /tmp/weechat.notify.sock") directly. And lo-and-behold, all the messages I was sending manually were appearing (still base64 encoded, of course).

So then I added on the logic to decode every message as it came in, like I had in my script, and it also worked perfectly for every message. Including when I fed these messages into notify-send.

So at this point I decided that something strange must be happening when I forked the function. But I observed no difference in effectiveness when I back-grounded the commands in the terminal. So I wondered if perhaps something strange was happening because it was being run from within a script.

That's when things got weird…

I began by breaking the logic out of the function, and just invoking it directly, with an ampersand at the end of the piped commands. Like so:

ssh $BOUNCER "nc -k -l -U /tmp/weechat.notify.sock" | \
  while read type message; do
    notify-send -i weechat -u critical "$(echo -n $type | base64 -di -)" "$(echo -n $message | base64 -di -)"
  done &

As soon as I did that, the messages suddenly started working. And as soon as I reverted the change, I was back to square one with the same strange two-message-only behavior.

But this fix, introduced some other strange behavior. Once I was inside the screen session, I would have to hit each key multiple times before it was registered by the program. As if there was a race-condition fighting over STDIN.

Figuring perhaps the two SSH sessions were fighting over it (though I wasn't sure why) I tried to close and/or occupy STDIN on the first ssh command through various means. Such as by piping in : | before it, or appedning <&- or </dev/null after the SSH portion of the pipe. And while that did seem to resolve the race-condition, this reintroduced the two-message-only behavior.

Thinking that it might have something to do with being under multiple-layers of sub-processing, I then attempted to reproduce this by wrapping the SSH call with bash -c like so: bash -c 'ssh $BOUNCER "nc -k -l -U /tmp/weechat.notify.sock" &'. And this too exhibited the two-message-only behavior.

I also went ahead and tested this directly on the remote machine (SSHing to localhost, and wrapping in two bash -c invocations) and witnessed the same broken behavior. It also does not seem to be related to double forking causing orphaned processes. As it does not seem to matter if the process ends up orphaned or not.

I also verified this is also happening under zsh.

It seems that this is somehow related to the way which STDIN and STDOUT are treated when a process is run under layers of sub-processing.

Repro. Instructions & strace Output:

In order to simplify debugging I removed SSH from the picture and wrote two simplified test scripts which successfully reproduced the behavior entirely locally.

Using the Juergen Nickelsen's socket command I created a local UNIX Domain Socket (socket -l -s ./test.sock), and once again was able to send test messages to it with irb using the following chunk of Ruby code:

require 'socket'
require 'base64'

SOCKET = './test.sock'

def send(subtitle, message)
  UNIXSocket.open(SOCKET) do |socket|
    socket.puts "#{Base64.strict_encode64(subtitle)} #{Base64.strict_encode64(message)}"
  end
end

send('test', 'hi')
send('test', 'hi')
send('test', 'hi')
send('test', 'hi')
send('test', 'hi')
send('test', 'hi')

The First Script only backgrounded the piped expression (which, as previously stated, processed an unlimited number of messages):

#!/bin/bash
 
# to aid in cleanup when using Ctrl-C to exit strace
trap "pkill -f -x 'nc -k -l -U $HOME/test.sock'; exit" SIGINT
 
# Start listening for notifications
nc -k -l -U $HOME/test.sock | \
  while read type message; do
    # write messages to a local file instead of sending to notification daemon for simplicity.
    echo "$(echo -n $type | base64 -di -)" "$(echo -n $message | base64 -di -)" >> /tmp/msg
  done &
 
read

And produced the following output when run with strace -f: http://pastebin.com/SMjti3qW

The Second Script backgrounded the wrapping function (which triggers the 2-and-done behavior):

#!/bin/bash

# to aid in cleanup when using Ctrl-C to exit strace
trap "pkill -f -x 'nc -k -l -U $HOME/test.sock'; exit" SIGINT

# Start listening for notifications
function irc_notify() {
  nc -k -l -U $HOME/test.sock | \
    while read type message; do
      # write messages to a local file instead of sending to notification daemon for simplicity.
      echo "$(echo -n $type | base64 -di -)" "$(echo -n $message | base64 -di -)" >> /tmp/msg
    done
}

irc_notify &

read

And which, in turn, produced this following output when run with strace -f: http://pastebin.com/WsrXX0EJ

One thing which stands out to me when looking at the strace output from the above scripts is the output specific to the nc command. Which seems to show one of the main differences between the execution of these two scripts.

The First script's "working" nc strace output:

accept(3, {sa_family=AF_FILE, NULL}, [2]) = 4
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "dGVzdA== aGk=\n", 2048)        = 14
write(1, "dGVzdA== aGk=\n", 14)         = 14
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "", 2048)                       = 0
shutdown(4, 0 /* receive */)            = 0
close(4)                                = 0
accept(3, {sa_family=AF_FILE, NULL}, [2]) = 4
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "dGVzdA== aGk=\n", 2048)        = 14
write(1, "dGVzdA== aGk=\n", 14)         = 14
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "", 2048)                       = 0
shutdown(4, 0 /* receive */)            = 0
close(4)                                = 0
accept(3, {sa_family=AF_FILE, NULL}, [2]) = 4
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "dGVzdA== aGk=\n", 2048)        = 14
write(1, "dGVzdA== aGk=\n", 14)         = 14
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "", 2048)                       = 0
shutdown(4, 0 /* receive */)            = 0
close(4)                                = 0
accept(3, {sa_family=AF_FILE, NULL}, [2]) = 4
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "dGVzdA== aGk=\n", 2048)        = 14
write(1, "dGVzdA== aGk=\n", 14)         = 14
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "", 2048)                       = 0
shutdown(4, 0 /* receive */)            = 0
close(4)                                = 0
accept(3, {sa_family=AF_FILE, NULL}, [2]) = 4
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "dGVzdA== aGk=\n", 2048)        = 14
write(1, "dGVzdA== aGk=\n", 14)         = 14
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "", 2048)                       = 0
shutdown(4, 0 /* receive */)            = 0
close(4)                                = 0
accept(3, {sa_family=AF_FILE, NULL}, [2]) = 4
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "dGVzdA== aGk=\n", 2048)        = 14
write(1, "dGVzdA== aGk=\n", 14)         = 14
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "", 2048)                       = 0
shutdown(4, 0 /* receive */)            = 0
close(4)                                = 0
accept(3,

The Second Script's "2-and-done" Behavior seen in nc strace output:

accept(3, {sa_family=AF_FILE, NULL}, [2]) = 4
poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 2 ([{fd=4, revents=POLLIN|POLLHUP}, {fd=0, revents=POLLHUP}])
read(4, "dGVzdA== aGk=\n", 2048)        = 14
write(1, "dGVzdA== aGk=\n", 14)         = 14
shutdown(4, 1 /* send */)               = 0
close(0)                                = 0
poll([{fd=4, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=4, revents=POLLIN|POLLHUP}])
read(4, "", 2048)                       = 0
shutdown(4, 0 /* receive */)            = 0
close(4)                                = 0
accept(3, {sa_family=AF_FILE, NULL}, [2]) = 0
poll([{fd=0, events=POLLIN}, {fd=0, events=POLLIN}], 2, -1) = 2 ([{fd=0, revents=POLLIN|POLLHUP}, {fd=0, revents=POLLIN|POLLHUP}])
read(0, "dGVzdA== aGk=\n", 2048)        = 14
write(1, "dGVzdA== aGk=\n", 14)         = 14
read(0, "", 2048)                       = 0
shutdown(0, 1 /* send */)               = 0
close(0)                                = 0
poll([{fd=0, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=0, revents=POLLNVAL}])
poll([{fd=0, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=0, revents=POLLNVAL}])
poll([{fd=0, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=0, revents=POLLNVAL}])
poll([{fd=0, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=0, revents=POLLNVAL}])
poll([{fd=0, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=0, revents=POLLNVAL}])
poll([{fd=0, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=0, revents=POLLNVAL}])
poll([{fd=0, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=0, revents=POLLNVAL}])
poll([{fd=0, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=0, revents=POLLNVAL}])
poll([{fd=0, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=0, revents=POLLNVAL}])
poll([{fd=0, events=POLLIN}, {fd=-1}], 2, -1) = 1 ([{fd=0, revents=POLLNVAL}])
.......[truncated].......

I'm not where I'd like to be in regards to my strace output readability, so I'm not exactly sure what these different outputs imply—aside from the fact that one is obviously working while the other is not.

As I've dug through the larger strace output, it also seems that messages after the first two are no longer terminated by a newline? But again, I'm not sure what that implies, or if I'm reading it correctly.

And I definitely do not understand how the different sub-processing techniques, or even closing STDIN, could be affecting this behavior.

Any idea what I'm encountering here?

tl;dr

I'm trying to figure out why running my notification listener under more than one-layer of sub-processing results in only two messages being processed; and not doing so results in a race-condition on STDIN.

Best Answer

Newer derivatives of the OpenBSD netcat, including FreeBSD[1] and Debian[2], support a -d flag which prevents reading from stdin and fixes the problem you described.

The problem is that netcat is polling stdin as well as its "network" fd, and stdin is reopened from /dev/null in the second case above, where the shell function is run in the background before the pipeline is created. That means an immediate EOF on the first read from stdin (fd 0), but netcat will continue to poll(2) on the now-closed stdin, creating an endless loop.

Here is the redirection of stdin before the pipeline creation:

249 [pid 23186] open("/dev/null", O_RDONLY <unfinished ...>
251 [pid 23186] <... open resumed> )        = 3
253 [pid 23186] dup2(3, 0)                  = 0
254 [pid 23186] close(3)                    = 0

Now when netcat (pid 23187) calls its first poll(2), it reads EOF from stdin and closes fd 0:

444 [pid 23187] poll([{fd=4, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 2 ([{fd=4, revents=POLLIN|POLLHUP}, {fd=0, revents=POLLIN}])
448 [pid 23187] read(0,  <unfinished ...>
450 [pid 23187] <... read resumed> "", 2048) = 0
456 [pid 23187] close(0 <unfinished ...>
458 [pid 23187] <... close resumed> )       = 0

The next call to accept(2) yields a client on fd 0, which is now the lowest-numbered free fd:

476 [pid 23187] accept(3,  <unfinished ...>
929 [pid 23187] <... accept resumed> {sa_family=AF_LOCAL, NULL}, [2]) = 0

Note here that netcat is now including fd 0 in the args to poll(2) twice: once for STDIN_FILENO, which is always included in the absence of the -d command-line parameter, and once for the newly-connected client:

930 [pid 23187] poll([{fd=0, events=POLLIN}, {fd=0, events=POLLIN}], 2, 4294967295) = 2 ([{fd=0, revents=POLLIN|POLLHUP}, {fd=0, revents=POLLIN|POLLHUP}])

The client sends EOF and netcat disconnects:

936 [pid 23187] read(0,  <unfinished ...>
938 [pid 23187] <... read resumed> "", 2048) = 0
940 [pid 23187] shutdown(0, SHUT_WR <unfinished ...>
942 [pid 23187] <... shutdown resumed> )    = 0
944 [pid 23187] close(0 <unfinished ...>
947 [pid 23187] <... close resumed> )       = 0

But now it's in trouble because it will continue to poll on fd 0, which is now closed. The netcat code does not handle the case of POLLNVAL being set in the .revents member of struct pollfd, so it gets into an endless loop, never to call accept(2) again:

949 [pid 23187] poll([{fd=0, events=POLLIN}, {fd=-1}], 2, 4294967295 <unfinished ...>
951 [pid 23187] <... poll resumed> )        = 1 ([{fd=0, revents=POLLNVAL}])
953 [pid 23187] poll([{fd=0, events=POLLIN}, {fd=-1}], 2, 4294967295 <unfinished ...>
955 [pid 23187] <... poll resumed> )        = 1 ([{fd=0, revents=POLLNVAL}])
...

In the first command, where the pipeline is backgrounded but is not run in a shell function, stdin is left open, so this case doesn't arise.

Code references (see the readwrite function):

  1. http://svnweb.freebsd.org/base/head/contrib/netcat/
  2. https://sources.debian.net/src/netcat-openbsd/1.105-7/
Related Question