Bash – Communicating from bash with grunt shell

bashfifoshell-script

I'm tired of the slow startup time of hadoop fs just to query HDFS. This isn't a problem with HDFS itself though, because using HDFS file system commands within the Pig "grunt shell" is pretty fast. But it's impractical to always startup the grunt shell when I just want to issue some HDFS commands. So I wrote this script to startup a grunt shell instance in the background for me and keep it open for succeeding calls:

#!/bin/bash

in=/tmp/grunt_in
out=/tmp/grunt_out
err=/tmp/grunt_err

if [ ! -p $in ]
then
    mkfifo $in
    mkfifo $out
    ( pig <>$in >$out 2>$err; rm $in $out ) &
    disown
fi

>$err # Truncate errors
echo "fs $*" >$in
echo >$in
echo "-- end" >$in
sed -n '/^grunt> -- end/q;/^grunt>/d;p' $out
cat $err >&2

Of course not only the input has to be send to the script, but also the output from the script has to be redirected to my current bash session. I use the /tmp/grunt_in and /tmp/grunt_out FIFOs here to accomplish that. To figure out when pig processed the command I send a "-- end" comment and detect that in the sed command which is listening on the output to make it quit when encountering the end token and only output the relevant part by skipping all grunt> prompts.

Note that I have to attach the input FIFO with <>$in even though I redirect the output to $out to prevent pig from quitting after the first command. I don't know exactly why, but I figured it works this way.

This actually works already quite nice. E.g.

$ time hadoop fs -ls
Found 38 items
[ skipped output ]

real    0m1.828s
user    0m3.160s
sys 0m0.137s

$ time dfs -ls

[apollo@dc1-had03-clusterutil01 reporting-APO-5394]$ time dfs -ls
Found 38 items
[ skipped output ]

real    0m0.149s
user    0m0.003s
sys 0m0.006s

(I called my script dfs here.) There are only two problems left which I can't figure out currently:

When I call the script the first time (that is when the fifo /tmp/grunt_in isn't existing yet and the pig instance is started in the background) my terminal settings are somehow messed up. I don't get an echo of my input anymore, so I have to type a reset blindly to get a sane terminal back. Succeeding calls work fine though.
When I try to output file contents on HDFS with -cat or -text the output gets arbitrarily truncated. E.g.:
```
$ hadoop fs -text some-medium-size.gz|wc -l
3606
$ dfs -text some-medium-size.gz|wc -l
text: Unable to write to output stream.
9
```
Note the error message text: Unable to write to output stream. here which is not coming from pig but from the fs -text command from hadoop. Sometimes it's truncated at the first 9 or 10 lines as here or sometimes somewhere at the middle. It's quite strange. I also tried to send the command manually to /tmp/grunt_in and reading /tmp/grunt_out with cat, with the same result, but this confirms that my parsing with sed can't be the issue here. This also doesn't seem to be a problem with big outputs in general, e.g. for long directory listings it just works fine:
```
$ dfs -ls -R|wc -l
10686
```

(Which gives the same result as hadoop fs -ls -R|wc -l)

Maybe the last problem is a problem with hadoop fs -text and hadoop fs -cat itself? Or do I something wrong with my use of named pipes?

Best Answer

I now more on less settled on this version:

#!/bin/bash

in=/tmp/grunt_in
out=/tmp/grunt_out
err=/tmp/grunt_err

if [ ! -p $in ]
then
    mkfifo $in
    mkfifo $out
    mkfifo $err
    { script -q -c "pig 1>$out 2>$err" <>$in; rm $in $out $err; } &
fi

{
    echo "fs $*"
    echo
    echo "-- end"
} >$in
cat $err >&2 &
catpid=$!
sed -n -u '/^grunt> -- end/q;/^grunt>/d;p' <$out
kill $catpid

So I simply redirect stderr inside of the script command. I also replaced the round braces with curly braces and removed the disown because I didn't see any advantage in doing that. I also replaced $err with a FIFO in order to be able to output it early, but that also adds some complications to kill the cat.

This works fairly well so far except that when I truncate the output by piping through head I get a truncated or extra output in the next command. Apparently I need a way to properly flush the named pipes. I would be glad if someone has any hints.

Related Solutions

Bash – redirect and log script output

function startLogging {
    exec > >(gawk -v pid=$$ '{ print strftime("%F-%T"),pid,$0; fflush(); }' | tee -a $logfile)
    [ ! -z "$DEBUG" ] && exec 2>&1 || exec 2> >(gawk -v pid=$$ '{ print strftime("%F-%T"),pid,$0; fflush(); }' >>$logfile)
    echo "=== Log started for $$ at $(date +%F-%T) ==="
}

You need to have $logfile set to something

Bash time keyword result only with second piped command, explain why

In bash, like in ksh, time is a keyword (not builtin) that is used to time a pipeline (not only simple command, it can also time compound commands).

In time cmd 2> something, we're timing cmd 2> something and printing the output to stderr, but to the original stderr.

You need stderr redirected before the time construct is invoked. Which you do with your |& that redirects the stdout and stderr of the subshell time is run in, but a much simpler way to do it would be:

time=$(TIMEFORMAT="%U^"; { time cmd; } 2>&1)

That doesn't involve a subshell (here we use a command group instead) nor an extra command.

Note that with bash:

time=$(time (cmd) 2>&1)

happens to work by accident. I wouldn't rely on that as it might change in future versions and doesn't work in other shells that have a time keyword.

If you wanted only the timing output in $time (and not the command's stdout or stderr), you'd do:

{ time=$(TIMEFORMAT="%U^"; { time cmd 2>&3 3>&-; } 2>&1); } 3>&1

Best Answer

Related Solutions

Bash – redirect and log script output

Bash time keyword result only with second piped command, explain why

Related Question