Shell Script Piping – Handling Commands with Very Large Output

pipeshell-script

I want to tar a directory and write the result to stdout, then pipe it to a compression program, like this:

tar -cvf - /tmp/source-dir | lzip -o /media/my-usb/result.lz -

I have been using pipe all the time with commands which output several lines of text. Now I am wondering what would happen when I pipes a (fast) command with very large output such as tar and a very slow compression command followed? Will tar wait for its output to be consumed by lzip? Or it just does as fast as it can then outputs everything to RAM? It will be a disaster in low RAM system if the latter is true.

Best Answer

When the data producer (tar) tries to write to the pipe too quickly for the consumer (lzip) to have time to read all of it, it will block until lzip has had time to read what tar is writing. There is a small buffer associated with the pipe, but its size is likely to be smaller than the size of most tar archives. There is no risk of filling up your system's RAM with your pipeline.

"Blocking" simply means that when tar does a call to the write() library function (or equivalent), the call won't return until the data has been delivered to the pipe buffer, which could take a bit of time if lzip is slow to read from that same buffer. You should be able to see this in top where tar would slow down and sleep a lot compared to lzip (assuming tar is in fact quicker than lzip).

You would therefore not fill up a significant amount of RAM with your pipeline. To do that (if you wanted to), you could use something like pv in the middle, with some large buffer (here, a gigabyte):

tar -cvf - /tmp/source-dir | pv --buffer-size 1G | lzip -o /media/my-usb/result.lz -

This would still block tar whenever pv blocks. pv would block when its buffer is full and it can't write to lzip.

The reverse situation works in a similar way, i.e. if you have a slow left-hand side of a pipe writing to a fast right-hand side, the consumer on the right would block on read() until there is data to be read from the pipe.

This (data I/O) is the only thing that synchronises the processes taking part in a pipeline. Apart from reading and writing (and occasionally blocking while waiting for someone else to read or write), they would run independently of each other.

Related Solutions

Command Piping – Good Examples of Piping Commands Together

I'm going to walk you through a somewhat complex example, based on a real life scenario.

Problem

Let's say the command conky stopped responding on my desktop, and I want to kill it manually. I know a little bit of Unix, so I know that what I need to do is execute the command kill <PID>. In order to retrieve the PID, I can use ps or top or whatever tool my Unix distribution has given me. But how can I do this in one command?

Answer

$ ps aux | grep conky | grep -v grep | awk '{print $2}' | xargs kill

DISCLAIMER: This command only works in certain cases. Don't copy/paste it in your terminal and start using it, it could kill processes unsuspectingly. Rather learn how to build it.

How it works

1- ps aux

This command will output the list of running processes and some info about them. The interesting info is that it'll output the PID of each process in its 2nd column. Here's an extract from the output of the command on my box:

$ ps aux
 rahmu     1925  0.0  0.1 129328  6112 ?        S    11:55   0:06 tint2
 rahmu     1931  0.0  0.3 154992 12108 ?        S    11:55   0:00 volumeicon
 rahmu     1933  0.1  0.2 134716  9460 ?        S    11:55   0:24 parcellite
 rahmu     1940  0.0  0.0  30416  3008 ?        S    11:55   0:10 xcompmgr -cC -t-5 -l-5 -r4.2 -o.55 -D6
 rahmu     1941  0.0  0.2 160336  8928 ?        Ss   11:55   0:00 xfce4-power-manager
 rahmu     1943  0.0  0.0  32792  1964 ?        S    11:55   0:00 /usr/lib/xfconf/xfconfd
 rahmu     1945  0.0  0.0  17584  1292 ?        S    11:55   0:00 /usr/lib/gamin/gam_server
 rahmu     1946  0.0  0.5 203016 19552 ?        S    11:55   0:00 python /usr/bin/system-config-printer-applet
 rahmu     1947  0.0  0.3 171840 12872 ?        S    11:55   0:00 nm-applet --sm-disable
 rahmu     1948  0.2  0.0 276000  3564 ?        Sl   11:55   0:38 conky -q

2- grep conky

I'm only interested in one process, so I use grep to find the entry corresponding to my program conky.

$ ps aux | grep conky
 rahmu     1948  0.2  0.0 276000  3564 ?        Sl   11:55   0:39 conky -q
 rahmu     3233  0.0  0.0   7592   840 pts/1    S+   16:55   0:00 grep conky

3- grep -v grep

As you can see in step 2, the command ps outputs the grep conky process in its list (it's a running process after all). In order to filter it, I can run grep -v grep. The option -v tells grep to match all the lines excluding the ones containing the pattern.

$ ps aux | grep conky | grep -v grep
 rahmu     1948  0.2  0.0 276000  3564 ?        Sl   11:55   0:39 conky -q

NB: I would love to know a way to do steps 2 and 3 in a single grep call.

4- awk '{print $2}'

Now that I have isolated my target process. I want to retrieve its PID. In other words I want to retrieve the 2nd word of the output. Lucky for me, most (all?) modern unices will provide some version of awk, a scripting language that does wonders with tabular data. Our task becomes as easy as print $2.

$ ps aux | grep conky | grep -v grep | awk '{print $2}'
 1948

5- xargs kill

I have the PID. All I need is to pass it to kill. To do this, I will use xargs.

xargs kill will read from the input (in our case from the pipe), form a command consisting of kill <items> (<items> are whatever it read from the input), and then execute the command created. In our case it will execute kill 1948. Mission accomplished.

Final words

Note that depending on what version of unix you're using, certain programs may behave a little differently (for example, ps might output the PID in column $3). If something seems wrong or different, read your vendor's documentation (or better, the man pages). Also be careful as long pipes can be dangerous. Don't make any assumptions especially when using commands like kill or rm. For example, if there was another user named 'conky' (or 'Aconkyous') my command may kill all his running processes too!

What I'm saying is be careful, especially for long pipes. It's always better to build it interactively as we did here, than make assumptions and feel sorry later.

Shell – Piping Commands After a Piped xargs

You are almost there. In your last command, you can use -I to do the ls correctly

-I replace-str
Replace occurrences of replace-str in the initial-arguments with names read from standard input. Also, unquoted blanks do not terminate input items; instead the separator is the newline character. Implies -x and -L 1.

So, with

find . -type d -name "*log*" | xargs -I {} sh -c "echo {}; ls -la {} | tail -2"

you will echo the dir found by find, then do the ls | tail on it.

Best Answer

Related Solutions

Command Piping – Good Examples of Piping Commands Together

Shell – Piping Commands After a Piped xargs

Related Question