Shell – Filter or pipe certain sections of a file

fifopipeshell-scripttext processing

I have an input file with some sections the are demarcated with start and end tags, for example:

line A
line B
@@inline-code-start
line X
line Y
line Z
@@inline-code-end
line C
line D

I want to apply a transformation to this file such that lines X,Y,Z are filtered through some command (nl, for example), but the rest of the lines pass through unchanged. Notice that nl (number lines) accumulates state across lines, so it is not a static transformation that is being applied to each of lines X,Y,Z. (Edit: it was pointed out that nl can work in a mode that doesn't require accumulated state, but I am just using nl as an example to simplify the question. In reality the command is a more complex custom script. What I am really looking for is a generic solution to the problem of applying a standard filter to a subsection of an input file)

The output should look like:

line A
line B
     1 line X
     2 line Y
     3 line Z
line C
line D

There can be several such sections in the file that require the transformation.

Update 2 I didn't originally specify what should happen if there is more one section, for example:

line A
line B
@@inline-code-start
line X
line Y
line Z
@@inline-code-end
line C
line D
 @@inline-code-start
line L
line M
line N
@@inline-code-end

My expectation would be that state would only need to be maintained within a given section, giving:

line A
line B
     1 line X
     2 line Y
     3 line Z
line C
line D
     1 line L
     2 line M
     3 line N

but, I think interpreting the problem as requiring the state to be kept across sections is valid, and useful in many contexts.

End Update 2

My first thought is to build a simple state machine that tracks what section we are in:

#!/usr/bin/bash
while read line
do
  if [[ $line == @@inline-code-start* ]]
  then
    active=true
  elif [[ $line == @@inline-code-end* ]]
  then
    active=false
  elif [[ $active = true ]]
  then
    # pipe
  echo $line | nl
  else
    # output
    echo $line
  fi
done

Which I run with:

cat test-inline-codify | ./inline-codify

This doesn't work since each call to nl is independent, so the line numbers do not increase:

line A
line B
     1  line X
     1  line Y
     1  line Z
line C
line D

My next attempt was to use a fifo:

#!/usr/bin/bash
mkfifo myfifo
nl < myfifo &
while read line
do
  if [[ $line == @@inline-code-start* ]]
  then
    active=true
  elif [[ $line == @@inline-code-end* ]]
  then
    active=false
  elif [[ $active = true ]]
  then
    # pipe
    echo $line > myfifo
  else
    # output
    echo $line
  fi
done
rm myfifo

This gives the correct output, but in the wrong order:

line A
line B
line C
line D
     1  line 1
     2  line 2
     3  line 3

There's probably some caching going on.

Am I going about this all wrong? This seems like a pretty generic problem. I feel like there ought to be a simple pipeline that would solve this.

Best Answer

I would agree with you - it probably is a generic problem. Some common utilities have some facilities for handling it, though.


nl

nl, for example, separates input into logical pages as -delimited by a two character section delimiter. Three occurrences on a line all alone indicate the start of a heading, two the body and one the footer. It replaces any of these found in input with a blank line in output - which are the only blank lines it ever prints

I altered your example to include another section and put it in ./infile. So it looks like this:

line A
line B
@@inline-code-start
line X
line Y
line Z
@@inline-code-end
line C
line D
@@start
line M
line N
line O
@@end

Then I ran the following:

sed 's/^@@.*start$/@@@@@@/
     s/^@@.*end$/@@/'  <infile |
nl -d@@ -ha -bn -w1

nl can be told to accumulate state across logical pages, but it does not by default. Instead it will number the lines of its input according to styles, and by section. So -ha means number all header lines and -bn means no body lines - as it starts out in a body state.

Until I learned this I used to use nl for any input, but after realizing that nl might distort output according to its default -delimiter \: I learned to be more careful with it and started using grep -nF '' for untested input instead. But another lesson learned that day was that nl can be very usefully applied in other respects - such as this one - if you just modify its input only a little - as I do with sed above.

OUTPUT

  line A
  line B

1       line X
2       line Y
3       line Z

  line C
  line D

1       line M
2       line N
3       line O

Here's some more about nl - do you notice above how all lines but the numbered ones start with spaces? When nl numbers lines it inserts a certain number of characters into the head of each. For those lines it doesn't number - even blanks - it always matches the indent by inserting ( -width count + -separator len ) * spaces at the head of unnumbered lines. This allows you to reproduce the not-numbered content exactly by comparing it to the numbered content - and with little effort. When you consider that nl will divide its input into logical sections for you, and that you can insert arbitrary -strings at the head of each line it numbers, then it gets pretty easy to handle its output:

sed 's/^@@.*start$/@@@@@@/
     s/^@@.*end/@@/; t
     s/^\(@@\)\{1,3\}$/& /' <infile |
nl -d@@ -ha -bn -s' do something with the next line!
'

The above prints...

                                        line A
                                        line B

 1 do something with the next line!
line X
 2 do something with the next line!
line Y
 3 do something with the next line!
line Z

                                        line C
                                        line D

 1 do something with the next line!
line M
 2 do something with the next line!
line N
 3 do something with the next line!
line O

GNU sed

If nl is not your target application, then a GNU sed can execute an arbitrary shell command for you depending on a match.

sed '/^@@.*start$/!b
     s//nl <<\\@@/;:l;N
     s/\(\n@@\)[^\n]*end$/\1/
Tl;e'  <infile

Above sed collects input in pattern space until it has enough to successfully pass the substitution Test and stop branching back to the the :label. When it does, it executes nl with input represented as a <<here-document for all of the rest of its pattern-space.

The workflow is like this:

  1. /^@@.*start$/!b
    • if an ^entire line$ does !not /match/ the above pattern, then it is branched out of the script and autoprinted - so from this point on we are only working with a series of lines which began with the pattern.
  2. s//nl <<\\@@/
    • the empty s//field/ stands in for the last address sed attempted to match - so this command substitutes the entire @@.*start line for nl <<\\@@ instead.
  3. :l;N
    • The : command defines a branch label - here I set one named :label. The Next command appends the next line of input to pattern space followed by a \newline character. This is one of only a few ways to get a \newline in a sed pattern space - the \newline character is a sure delimiter to a sedder who has been doing it awhile.
  4. s/\(\n@@\)[^\n]*end$/\1/
    • this s///ubstitution can only be successful after a start is encountered and only on the first following occurrence of an end line. It will only act on a pattern space in which the final \newline is immediately followed by @@.*end marking the very end$ of pattern space. When it does act, it replaces the whole matched string with the \1first \(group\), or \n@@.
  5. Tl
    • the Test command branches to a label (if provided) if a successful substitution has not occurred since the last time an input line was pulled into pattern space (as I do w/ N). This means that each time a \newline is appended to pattern space which does not match your end delimiter, the Test command fails and branches back to the :label, which results in sed pulling in the Next line and looping until successful.
  6. e

    • When the substitution for the end match is successful and the script does not branch back for a failed Test, sed will execute a command that looks like this:

      nl <<\\@@\nline X\nline Y\nline Z\n@@$
      

You can see this for yourself by editing the last line there to look like Tl;l;e.

It prints:

line A
line B
     1  line X
     2  line Y
     3  line Z
line C
line D
     1  line M
     2  line N
     3  line O

while ... read

One last way to do this, and maybe the most simple way, is to use a while read loop, but for good reason. The shell - (most especially a bash shell) - is typically pretty abysmal at handling input in large amounts or in steady streams. This makes sense, too - the shell's job is to handle input character by character and to call up other commands which can handle the bigger stuff.

But importantly about its role there is that the shell must not read overmuch of the input - it is specified to not buffer input or output to the point that it consumes so much or doesn't relay enough in time that the commands it calls are left lacking - to the byte. So read makes for an excellent input test - to return information about whether there is input remaining and you should call up the next command to read it - but it is otherwise generally not the best way to go.

Here's an example, however, of how one might use read and other commands to process input in sync:

while   IFS= read -r line        &&
case    $line in (@@*start) :;;  (*)
        printf %s\\n "$line"
        sed -un "/^@@.*start$/q;p";;
esac;do sed -un "/^@@.*end$/q;=;p" |
        paste -d: - -
done    <infile

The first thing that happens for each iteration is read pulls in a line. If it is successful it means the loop has not yet hit EOF and so in the case it matches a start delimiter the do block is immediately executed. Else, printf prints the $line it read and sed is called.

sed will print every line until it encounters the start marker - when it quits input entirely. The -unbuffered switch is necessary for GNU sed because it can buffer rather greedily otherwise, but - according to spec - other POSIX seds should work without any special consideration - so long as <infile is a regular file.

When the first sed quits, the shell executes the do block of the loop - which calls another sed that prints every line until it encounters the end marker. It pipes its output to paste, because it prints line numbers each on their own line. Like this:

1
line M
2
line N
3
line O

paste then pastes those together on : characters, and the whole output looks like:

line A
line B
1:line X
2:line Y
3:line Z
line C
line D
1:line M
2:line N
3:line O

These are just examples - anything could be done in either the test or do blocks here, but the first utility must not consume too much input.

All of the utilities involved read the same input - and print their results - each in their own turn. This kind of thing can be difficult to get the hang of - because different utilities will buffer more than others - but you can generally rely on dd, head, and sed to do the right thing (though, for GNU sed, you need the cli-switch) and you should always be able to rely on read - because it is, by nature, very slow. And that's why the above loop calls it only the one time per input block.

Related Question