Uniq and bash for loop not writing to stdout before stdin closing (for one-line website visitor notification system)

forpipestdinstdoutuniq

I'm trying to trigger a beep on the PC speaker for every unique visitor of a website.

After some brainstorming, it seemed to be possible with one line:

for e in `ssh me@mymachine "tail -n 1 -f /var/log/apache2/test.log | awk '{print $1}' | uniq"`; do beep; done

However uniq doesn't output anything as long as stdin is open (seems to wait for EOF). The same goes for the for loop. If I remove the uniq from the chain, I still get no output with tail keeping the pipe open.

This seems not to be because of buffering. Even if I write >100.000 lines into the test file with this command running, there's no output on the other end.

Is there a way to get that working without completely killing the beauty (simplicity) of the solution?

Update

I solved the first part. uniq is unblocked by prefixing the tail command with stdbuf -oL -eL (see https://unix.stackexchange.com/a/25378/109296).
The same doesn't work for the loop.

Update 2

I got it working – but not exactly according to my spec and with 2 lines:

while [ 1 -eq 1 ]; do ssh root@speedy "stdbuf -oL -eL tail -n 1 -f /var/log/apache2/www.access.log | stdbuf -oL -eL grep 'GET / '"; sleep 60; done > www.log

awk '{print $1}' is missing because it didn't work inside this construct (just passed through the whole line). I don't know why. But I can live without, because anyway
uniq turned out not to be so useful after all, because it only looks at adjacent lines, which means that the requests patterns ip1, ip2, ip1 would still let ip1 through twice.
uniq -u would do what I expect, but it has the same problem like sort: doesn't output anything as long as stdin is open (not even with stdbuf -oL.

This command just writes all requests for the base URL (/) to another file.
I wrapped it into a loop (and wait) in order to have it automatically retry if for some reason the pipe or connection interrupts.

while inotifywait -e modify www.log; do beep -f 250; done
makes the sound! I could not get the bash for loop to process line by line unbuffered, also tried while read with the same result.
Thus I gave up and went on with inotifywait which however means that I need an intermediate file (maybe a named pipe would also work, didn't try. Doesn't really make a difference for me).

I'd still be thankful to contributions which help to make the filtering of unique visitors work (without escalating complexity).

This will be a nice surprise for my team members when they return to the office 🙂

I plan to extend this notification system to monitor several events, using different audio frequencies. That's the best job I've found so far for an old server collecting dust…

Best Answer

I think I understand what you are trying to accomplish:

For each hit to the web site, which is logged by the web server:
If the visit is "unique" (how do you define this??) log the entry and send an audible notification.

The trick is how you define "unique". Is it by URL, by IP address, by cookie? Your approach with awk was arguably the right way to go, but you got snagged by shell-escaping rules.

So here is something that sort of combines your approaches. First, you really need a script on the web server to do this. Otherwise you're going to be lost in complex quotation-escaping rules. Second, I'm assuming your web-server is using the "common-log format", which frankly, sucks for this kind of work, but we can work with it.

while true; do 
  ssh root@speedy remote-log-capturing-script
done > unique-visits.log

Use mikeserv's excellent suggestion about MAILFILE. The script on speedy should look like this:

#!/bin/sh
tail -1f /var/log/apache2/www.access.log | 
awk '$(NF-1) == 200' | 
grep --line-buffered -o '"GET [^"]*"' |
awk '!url[$1]{ print; url[$1]=1 }'

Awk is always line-buffered. The first awk ensures you're only getting actual successful hits, not cached-hits or 404s. The grep -o prints out only the matching part of the input, in this case, the URL. (This is GNU grep, which I assume you are using. If not, use the stdbuf trick.) The next awk uses a little expression to conditionally print out the input line -- only if that input line was never before seen.

You can also do this with perl to achieve more complexity within one fork:

#!/bin/sh
tail -1f /var/log/apache2/www.access.log | 
perl -lane '$|=1;' \
  -e 'if ($F[$#F-1] eq "200" and ' \
  -e ' /\s"GET\s([^"]*)"\s/ and !$url{$1}) { '\
  -e '  print $1;$url{$1}=undef; }'

Now both of these will only print unique URLs. What if two web clients from different IPs hit the same page? You only get one output. To change that, with the perl solutions, this is easy: modify the key that goes into url.

 $url{$F[0],$1}

When using perl -a, $F[0] represents the first white-space-delimited field of input, just like awk's $1 -- ie, the connecting hostname/IP address. And perl's $1 represents the first matching subexpression of the regular-expression /\s"GET\s([^"]*)"\s/, ie, just the URL itself. The cryptic $F[$#F-1] means 2nd-to-last field of the input line.

Related Solutions

Is this tail behavior in Grouping Commands specified by POSIX

Note that the problem is not with tail but with head here which reads from the pipe more than the first line it is meant to output (so there's nothing left for tail to read).

And yes, it's POSIX conformant.

head is required to leave the cursor within stdin just after the last line it has output when the input is seekable, but not otherwise.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap01.html:

When a standard utility reads a seekable input file and terminates without an error before it reaches end-of-file, the utility shall ensure that the file offset in the open file description is properly positioned just past the last byte processed by the utility. For files that are not seekable, the state of the file offset in the open file description for that file is unspecified.

For head to be able to do that for a non-seekable file would mean it would have to read one byte at a time which would be terribly inefficient¹. That's what the read or line utility do or GNU sed with the -u option.

So you can replace head -n 20 with gsed -u 20q if you want that behaviour.

Though here, you'd rather want:

sed -e 1b -e '$b' -e d

instead. Here, only one tool invocation, so no problem with an internal buffer that can't be shared between two tool invocations. Note however that for large files, it's going to be less efficient as sed reads the whole file, while for seekable files tail would skip most of it by seeking near the end of the file.

See the related discussion about buffering at Why is using a shell loop to process text considered bad practice?.

Note that tail must output the tail of the stream on stdin. While, as an optimisation and for seekable files, implementations may seek to the end of the file to get the trailing data from there, it is not allowed to seek back to a point that would be before the initial position at the time tail was invoked (Busybox tail used to have that bug).

So for instance in:

{ cat; tail -n 1; } < file

Even though tail could seek back to the last line of file, it does not. Its stdin is an empty stream as cat left the cursor at the end of the file; it's not allowed to reclaim data from that stream by seeking further backward in the file.

^{(Text above crossed out pending clarification by the Open Group and considering that it's not done correctly by several implementations)}

^{¹ The head builtin of ksh93 (enabled if you put /opt/ast/bin ahead of $PATH), for sockets (a type of non-seekable files) instead peeks at the input (using recvfrom(..., MSG_PEEK)) prior to actually reading it to see how much it needs to read to make sure it doesn't read too much. And falls back to reading one byte at a time for other types of files. That is slightly more efficient and I believe is the main reason why it implements its pipes with socketpair()s instead of pipe(). Note that it's not completely fool proof as there's a race condition that could be triggered if another process read from the socket in between the peek and the read.}

Best Answer

Related Solutions

Is this tail behavior in Grouping Commands specified by POSIX

Related Question