Bash – way to specify a section of a pipeline be completely “pass through”

bashpipe

I have a script in which data is processed by streaming it through a fairly large pipeline. Several sections of the pipeline are actually "switchboard" functions that do different things based on some external parameter. A contrived example is given below.

#! /bin/bash

switchboard() {
    # Select the appropriate command depending on input.
    case "$1" in
        1)
            sort
            ;;
        2)
            awk '{ print $5 }' | sort
            ;;
        *)
            cat  # <= Is there something more optimal here?
            ;;
    esac
}

# The data processing pipeline.
<"$1" tr '[:upper:]' '[:lower:]' | switchboard "$2" | head -n 10

In the "switchboard" function, the fallback is just to use cat to send the input directly to the output. This works just fine, but in my pipeline I may have many "switchboards" and I'd like to avoid creating a bunch of do-nothing cat processes if possible.

Is there some sort of bash built-in (or alternative) that can be used to specify that a given section of a pipeline should connect STDOUT directly to STDIN without having to use a subprocess? (I tried : but that just ate the data) Or, does cat use such a small amount of resources that this is a non-issue?

Best Answer

First, the use of yet another cat doesn't really make much difference, and you shouldn't bother about it.

Second, the commands that make up a pipeline are executed in separate processes anyway, no matter if they're external commands or built-ins:

$ a=0
$ a=1 | a=2 | a=3
$ echo $a
0

As to your exact problem, it's not possible to simply connect 'stdin' to 'stdout'; even if a shell had some nop builtin which would collapse when used in a pipeline (eg | nop | -> |), the shell has no way to know in advance, at the time it sets up the pipeline, that your "switchboard" will switch to nop instead of awk or sort.

You can also achieve the same effect as you "switchboards" by building the pipeline yourself, and then calling eval to run it. Example:

$ cat test.sh
type=`file -zi "$1"`
case $type in
*application/gzip*)     mycat='zcat "$1"';;
*)                      mycat='cat "$1"';;
esac
case $type in
*charset=utf-16le*)     mycat="$mycat | iconv -f utf16le";;
esac
# highlight comments in blue
esc=`printf '\033'`;
mycat="$mycat | sed 's/^#.*/$esc[34m&$esc[m/'"
echo >&2 "$mycat"    # show the built pipeline
eval "$mycat"   # ... and run it
$ iconv -t utf16 test.sh > test16.sh; gzip test16.sh
$ sh test.sh test16.sh.gz

That's a bit off-topic, but on linux there is a faster way to copy the stdin to stdout (if any of them is a pipe) -- the splice(2) syscall, which doesn't involve moving the data to and from the userland:

$ cat splice_cat.c
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdlib.h>
#include <err.h>

int main(int ac, char **av){
    ssize_t r;
    size_t block = ac > 1 ? strtoul(av[1], 0, 0) : 0x20000;
    for(;;)
            if((r = splice(0, NULL, 1, NULL, block, 0)) < 1){
                    if(r < 0) err(1, "splice");
                    return 0;
            }
}
$ cc -Wall splice_cat.c -o splice_cat
$ dd if=/dev/zero bs=1M count=100 status=none | (time cat >/dev/null)
real    0m0.153s
user    0m0.012s
sys     0m0.056s
$ dd if=/dev/zero bs=1M count=100 status=none | (time ./splice_cat >/dev/null)
real    0m0.100s
user    0m0.004s
sys     0m0.020s

However (afaik), that's not used by either the shell or cat, dd, etc.

Related Question