I'm tired of the slow startup time of hadoop fs
just to query HDFS. This isn't a problem with HDFS itself though, because using HDFS file system commands within the Pig "grunt shell" is pretty fast. But it's impractical to always startup the grunt shell when I just want to issue some HDFS commands. So I wrote this script to startup a grunt shell instance in the background for me and keep it open for succeeding calls:
#!/bin/bash
in=/tmp/grunt_in
out=/tmp/grunt_out
err=/tmp/grunt_err
if [ ! -p $in ]
then
mkfifo $in
mkfifo $out
( pig <>$in >$out 2>$err; rm $in $out ) &
disown
fi
>$err # Truncate errors
echo "fs $*" >$in
echo >$in
echo "-- end" >$in
sed -n '/^grunt> -- end/q;/^grunt>/d;p' $out
cat $err >&2
Of course not only the input has to be send to the script, but also the output from the script has to be redirected to my current bash session. I use the /tmp/grunt_in
and /tmp/grunt_out
FIFOs here to accomplish that. To figure out when pig
processed the command I send a "-- end"
comment and detect that in the sed
command which is listening on the output to make it quit when encountering the end
token and only output the relevant part by skipping all grunt>
prompts.
Note that I have to attach the input FIFO with <>$in
even though I redirect the output to $out
to prevent pig from quitting after the first command. I don't know exactly why, but I figured it works this way.
This actually works already quite nice. E.g.
$ time hadoop fs -ls
Found 38 items
[ skipped output ]
real 0m1.828s
user 0m3.160s
sys 0m0.137s
$ time dfs -ls
[apollo@dc1-had03-clusterutil01 reporting-APO-5394]$ time dfs -ls
Found 38 items
[ skipped output ]
real 0m0.149s
user 0m0.003s
sys 0m0.006s
(I called my script dfs
here.) There are only two problems left which I can't figure out currently:
- When I call the script the first time (that is when the fifo
/tmp/grunt_in
isn't existing yet and the pig instance is started in the background) my terminal settings are somehow messed up. I don't get an echo of my input anymore, so I have to type areset
blindly to get a sane terminal back. Succeeding calls work fine though. -
When I try to output file contents on HDFS with
-cat
or-text
the output gets arbitrarily truncated. E.g.:$ hadoop fs -text some-medium-size.gz|wc -l 3606 $ dfs -text some-medium-size.gz|wc -l text: Unable to write to output stream. 9
Note the error message
text: Unable to write to output stream.
here which is not coming frompig
but from thefs -text
command fromhadoop
. Sometimes it's truncated at the first 9 or 10 lines as here or sometimes somewhere at the middle. It's quite strange. I also tried to send the command manually to/tmp/grunt_in
and reading/tmp/grunt_out
withcat
, with the same result, but this confirms that my parsing withsed
can't be the issue here. This also doesn't seem to be a problem with big outputs in general, e.g. for long directory listings it just works fine:$ dfs -ls -R|wc -l 10686
(Which gives the same result as hadoop fs -ls -R|wc -l
)
Maybe the last problem is a problem with hadoop fs -text
and hadoop fs -cat
itself? Or do I something wrong with my use of named pipes?
Best Answer
I now more on less settled on this version:
So I simply redirect stderr inside of the
script
command. I also replaced the round braces with curly braces and removed thedisown
because I didn't see any advantage in doing that. I also replaced$err
with a FIFO in order to be able to output it early, but that also adds some complications to kill thecat
.This works fairly well so far except that when I truncate the output by piping through
head
I get a truncated or extra output in the next command. Apparently I need a way to properly flush the named pipes. I would be glad if someone has any hints.