Uniq and bash for loop not writing to stdout before stdin closing (for one-line website visitor notification system)

forpipestdinstdoutuniq

I'm trying to trigger a beep on the PC speaker for every unique visitor of a website.

After some brainstorming, it seemed to be possible with one line:

for e in `ssh me@mymachine "tail -n 1 -f /var/log/apache2/test.log | awk '{print $1}' | uniq"`; do beep; done

However uniq doesn't output anything as long as stdin is open (seems to wait for EOF). The same goes for the for loop. If I remove the uniq from the chain, I still get no output with tail keeping the pipe open.

This seems not to be because of buffering. Even if I write >100.000 lines into the test file with this command running, there's no output on the other end.

Is there a way to get that working without completely killing the beauty (simplicity) of the solution?

Update

I solved the first part. uniq is unblocked by prefixing the tail command with stdbuf -oL -eL (see https://unix.stackexchange.com/a/25378/109296).
The same doesn't work for the loop.

Update 2

I got it working – but not exactly according to my spec and with 2 lines:

while [ 1 -eq 1 ]; do ssh root@speedy "stdbuf -oL -eL tail -n 1 -f /var/log/apache2/www.access.log | stdbuf -oL -eL grep 'GET / '"; sleep 60; done > www.log

awk '{print $1}' is missing because it didn't work inside this construct (just passed through the whole line). I don't know why. But I can live without, because anyway
uniq turned out not to be so useful after all, because it only looks at adjacent lines, which means that the requests patterns ip1, ip2, ip1 would still let ip1 through twice.
uniq -u would do what I expect, but it has the same problem like sort: doesn't output anything as long as stdin is open (not even with stdbuf -oL.

This command just writes all requests for the base URL (/) to another file.
I wrapped it into a loop (and wait) in order to have it automatically retry if for some reason the pipe or connection interrupts.

while inotifywait -e modify www.log; do beep -f 250; done
makes the sound! I could not get the bash for loop to process line by line unbuffered, also tried while read with the same result.
Thus I gave up and went on with inotifywait which however means that I need an intermediate file (maybe a named pipe would also work, didn't try. Doesn't really make a difference for me).

I'd still be thankful to contributions which help to make the filtering of unique visitors work (without escalating complexity).

This will be a nice surprise for my team members when they return to the office 🙂

I plan to extend this notification system to monitor several events, using different audio frequencies. That's the best job I've found so far for an old server collecting dust…

Best Answer

I think I understand what you are trying to accomplish:

  1. For each hit to the web site, which is logged by the web server:
  2. If the visit is "unique" (how do you define this??) log the entry and send an audible notification.

The trick is how you define "unique". Is it by URL, by IP address, by cookie? Your approach with awk was arguably the right way to go, but you got snagged by shell-escaping rules.

So here is something that sort of combines your approaches. First, you really need a script on the web server to do this. Otherwise you're going to be lost in complex quotation-escaping rules. Second, I'm assuming your web-server is using the "common-log format", which frankly, sucks for this kind of work, but we can work with it.

while true; do 
  ssh root@speedy remote-log-capturing-script
done > unique-visits.log

Use mikeserv's excellent suggestion about MAILFILE. The script on speedy should look like this:

#!/bin/sh
tail -1f /var/log/apache2/www.access.log | 
awk '$(NF-1) == 200' | 
grep --line-buffered -o '"GET [^"]*"' |
awk '!url[$1]{ print; url[$1]=1 }'

Awk is always line-buffered. The first awk ensures you're only getting actual successful hits, not cached-hits or 404s. The grep -o prints out only the matching part of the input, in this case, the URL. (This is GNU grep, which I assume you are using. If not, use the stdbuf trick.) The next awk uses a little expression to conditionally print out the input line -- only if that input line was never before seen.

You can also do this with perl to achieve more complexity within one fork:

#!/bin/sh
tail -1f /var/log/apache2/www.access.log | 
perl -lane '$|=1;' \
  -e 'if ($F[$#F-1] eq "200" and ' \
  -e ' /\s"GET\s([^"]*)"\s/ and !$url{$1}) { '\
  -e '  print $1;$url{$1}=undef; }'

Now both of these will only print unique URLs. What if two web clients from different IPs hit the same page? You only get one output. To change that, with the perl solutions, this is easy: modify the key that goes into url.

 $url{$F[0],$1}

When using perl -a, $F[0] represents the first white-space-delimited field of input, just like awk's $1 -- ie, the connecting hostname/IP address. And perl's $1 represents the first matching subexpression of the regular-expression /\s"GET\s([^"]*)"\s/, ie, just the URL itself. The cryptic $F[$#F-1] means 2nd-to-last field of the input line.

Related Question