Bash – way to specify a section of a pipeline be completely “pass through”

bashpipe

I have a script in which data is processed by streaming it through a fairly large pipeline. Several sections of the pipeline are actually "switchboard" functions that do different things based on some external parameter. A contrived example is given below.

#! /bin/bash

switchboard() {
    # Select the appropriate command depending on input.
    case "$1" in
        1)
            sort
            ;;
        2)
            awk '{ print $5 }' | sort
            ;;
        *)
            cat  # <= Is there something more optimal here?
            ;;
    esac
}

# The data processing pipeline.
<"$1" tr '[:upper:]' '[:lower:]' | switchboard "$2" | head -n 10

In the "switchboard" function, the fallback is just to use cat to send the input directly to the output. This works just fine, but in my pipeline I may have many "switchboards" and I'd like to avoid creating a bunch of do-nothing cat processes if possible.

Is there some sort of bash built-in (or alternative) that can be used to specify that a given section of a pipeline should connect STDOUT directly to STDIN without having to use a subprocess? (I tried : but that just ate the data) Or, does cat use such a small amount of resources that this is a non-issue?

Best Answer

First, the use of yet another cat doesn't really make much difference, and you shouldn't bother about it.

Second, the commands that make up a pipeline are executed in separate processes anyway, no matter if they're external commands or built-ins:

$ a=0
$ a=1 | a=2 | a=3
$ echo $a
0

As to your exact problem, it's not possible to simply connect 'stdin' to 'stdout'; even if a shell had some nop builtin which would collapse when used in a pipeline (eg | nop | -> |), the shell has no way to know in advance, at the time it sets up the pipeline, that your "switchboard" will switch to nop instead of awk or sort.

You can also achieve the same effect as you "switchboards" by building the pipeline yourself, and then calling eval to run it. Example:

$ cat test.sh
type=`file -zi "$1"`
case $type in
*application/gzip*)     mycat='zcat "$1"';;
*)                      mycat='cat "$1"';;
esac
case $type in
*charset=utf-16le*)     mycat="$mycat | iconv -f utf16le";;
esac
# highlight comments in blue
esc=`printf '\033'`;
mycat="$mycat | sed 's/^#.*/$esc[34m&$esc[m/'"
echo >&2 "$mycat"    # show the built pipeline
eval "$mycat"   # ... and run it
$ iconv -t utf16 test.sh > test16.sh; gzip test16.sh
$ sh test.sh test16.sh.gz

That's a bit off-topic, but on linux there is a faster way to copy the stdin to stdout (if any of them is a pipe) -- the splice(2) syscall, which doesn't involve moving the data to and from the userland:

$ cat splice_cat.c
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdlib.h>
#include <err.h>

int main(int ac, char **av){
    ssize_t r;
    size_t block = ac > 1 ? strtoul(av[1], 0, 0) : 0x20000;
    for(;;)
            if((r = splice(0, NULL, 1, NULL, block, 0)) < 1){
                    if(r < 0) err(1, "splice");
                    return 0;
            }
}
$ cc -Wall splice_cat.c -o splice_cat
$ dd if=/dev/zero bs=1M count=100 status=none | (time cat >/dev/null)
real    0m0.153s
user    0m0.012s
sys     0m0.056s
$ dd if=/dev/zero bs=1M count=100 status=none | (time ./splice_cat >/dev/null)
real    0m0.100s
user    0m0.004s
sys     0m0.020s

However (afaik), that's not used by either the shell or cat, dd, etc.

`nl`

nl, for example, separates input into logical pages as -delimited by a two character section delimiter. Three occurrences on a line all alone indicate the start of a heading, two the body and one the footer. It replaces any of these found in input with a blank line in output - which are the only blank lines it ever prints

I altered your example to include another section and put it in ./infile. So it looks like this:

line A
line B
@@inline-code-start
line X
line Y
line Z
@@inline-code-end
line C
line D
@@start
line M
line N
line O
@@end

Then I ran the following:

sed 's/^@@.*start$/@@@@@@/
     s/^@@.*end$/@@/'  <infile |
nl -d@@ -ha -bn -w1

nl can be told to accumulate state across logical pages, but it does not by default. Instead it will number the lines of its input according to styles, and by section. So -ha means number all header lines and -bn means no body lines - as it starts out in a body state.

Until I learned this I used to use nl for any input, but after realizing that nl might distort output according to its default -delimiter \: I learned to be more careful with it and started using grep -nF '' for untested input instead. But another lesson learned that day was that nl can be very usefully applied in other respects - such as this one - if you just modify its input only a little - as I do with sed above.

OUTPUT

  line A
  line B

1       line X
2       line Y
3       line Z

  line C
  line D

1       line M
2       line N
3       line O

Here's some more about nl - do you notice above how all lines but the numbered ones start with spaces? When nl numbers lines it inserts a certain number of characters into the head of each. For those lines it doesn't number - even blanks - it always matches the indent by inserting ( -width count + -separator len ) * spaces at the head of unnumbered lines. This allows you to reproduce the not-numbered content exactly by comparing it to the numbered content - and with little effort. When you consider that nl will divide its input into logical sections for you, and that you can insert arbitrary -strings at the head of each line it numbers, then it gets pretty easy to handle its output:

sed 's/^@@.*start$/@@@@@@/
     s/^@@.*end/@@/; t
     s/^\(@@\)\{1,3\}$/& /' <infile |
nl -d@@ -ha -bn -s' do something with the next line!
'

The above prints...

                                        line A
                                        line B

 1 do something with the next line!
line X
 2 do something with the next line!
line Y
 3 do something with the next line!
line Z

                                        line C
                                        line D

 1 do something with the next line!
line M
 2 do something with the next line!
line N
 3 do something with the next line!
line O

GNU `sed`

If nl is not your target application, then a GNU sed can execute an arbitrary shell command for you depending on a match.

sed '/^@@.*start$/!b
     s//nl <<\\@@/;:l;N
     s/\(\n@@\)[^\n]*end$/\1/
Tl;e'  <infile

Above sed collects input in pattern space until it has enough to successfully pass the substitution Test and stop branching back to the the :label. When it does, it executes nl with input represented as a <<here-document for all of the rest of its pattern-space.

The workflow is like this:

/^@@.*start$/!b
- if an ^entire line$ does !not /match/ the above pattern, then it is branched out of the script and autoprinted - so from this point on we are only working with a series of lines which began with the pattern.
s//nl <<\\@@/
- the empty s//field/ stands in for the last address sed attempted to match - so this command substitutes the entire @@.*start line for nl <<\\@@ instead.
:l;N
- The : command defines a branch label - here I set one named :label. The Next command appends the next line of input to pattern space followed by a \newline character. This is one of only a few ways to get a \newline in a sed pattern space - the \newline character is a sure delimiter to a sedder who has been doing it awhile.
s/$\n@@$[^\n]*end$/\1/
- this s///ubstitution can only be successful after a start is encountered and only on the first following occurrence of an end line. It will only act on a pattern space in which the final \newline is immediately followed by @@.*end marking the very end$ of pattern space. When it does act, it replaces the whole matched string with the \1first $group$, or \n@@.
Tl
- the Test command branches to a label (if provided) if a successful substitution has not occurred since the last time an input line was pulled into pattern space (as I do w/ N). This means that each time a \newline is appended to pattern space which does not match your end delimiter, the Test command fails and branches back to the :label, which results in sed pulling in the Next line and looping until successful.
e
- When the substitution for the end match is successful and the script does not branch back for a failed Test, sed will execute a command that looks like this:
```
nl <<\\@@\nline X\nline Y\nline Z\n@@$
```

You can see this for yourself by editing the last line there to look like Tl;l;e.

It prints:

line A
line B
     1  line X
     2  line Y
     3  line Z
line C
line D
     1  line M
     2  line N
     3  line O

`while ... read`

One last way to do this, and maybe the most simple way, is to use a while read loop, but for good reason. The shell - (most especially a bash shell) - is typically pretty abysmal at handling input in large amounts or in steady streams. This makes sense, too - the shell's job is to handle input character by character and to call up other commands which can handle the bigger stuff.

But importantly about its role there is that the shell must not read overmuch of the input - it is specified to not buffer input or output to the point that it consumes so much or doesn't relay enough in time that the commands it calls are left lacking - to the byte. So read makes for an excellent input test - to return information about whether there is input remaining and you should call up the next command to read it - but it is otherwise generally not the best way to go.

Here's an example, however, of how one might use read and other commands to process input in sync:

while   IFS= read -r line        &&
case    $line in (@@*start) :;;  (*)
        printf %s\\n "$line"
        sed -un "/^@@.*start$/q;p";;
esac;do sed -un "/^@@.*end$/q;=;p" |
        paste -d: - -
done    <infile

The first thing that happens for each iteration is read pulls in a line. If it is successful it means the loop has not yet hit EOF and so in the case it matches a start delimiter the do block is immediately executed. Else, printf prints the $line it read and sed is called.

sed will print every line until it encounters the start marker - when it quits input entirely. The -unbuffered switch is necessary for GNU sed because it can buffer rather greedily otherwise, but - according to spec - other POSIX seds should work without any special consideration - so long as <infile is a regular file.

When the first sed quits, the shell executes the do block of the loop - which calls another sed that prints every line until it encounters the end marker. It pipes its output to paste, because it prints line numbers each on their own line. Like this:

1
line M
2
line N
3
line O

paste then pastes those together on : characters, and the whole output looks like:

line A
line B
1:line X
2:line Y
3:line Z
line C
line D
1:line M
2:line N
3:line O

These are just examples - anything could be done in either the test or do blocks here, but the first utility must not consume too much input.

All of the utilities involved read the same input - and print their results - each in their own turn. This kind of thing can be difficult to get the hang of - because different utilities will buffer more than others - but you can generally rely on dd, head, and sed to do the right thing (though, for GNU sed, you need the cli-switch) and you should always be able to rely on read - because it is, by nature, very slow. And that's why the above loop calls it only the one time per input block.

Best Answer

Related Solutions

Bash Process Substitution does not work as ‘root’ on OS X

Shell – Filter or pipe certain sections of a file

nl

OUTPUT

GNU sed

while ... read

Related Question

`nl`

GNU `sed`

`while ... read`