I have an input file with some sections the are demarcated with start and end tags, for example:
line A
line B
@@inline-code-start
line X
line Y
line Z
@@inline-code-end
line C
line D
I want to apply a transformation to this file such that lines X,Y,Z are filtered through some command (nl
, for example), but the rest of the lines pass through unchanged. Notice that nl
(number lines) accumulates state across lines, so it is not a static transformation that is being applied to each of lines X,Y,Z. (Edit: it was pointed out that nl
can work in a mode that doesn't require accumulated state, but I am just using nl
as an example to simplify the question. In reality the command is a more complex custom script. What I am really looking for is a generic solution to the problem of applying a standard filter to a subsection of an input file)
The output should look like:
line A
line B
1 line X
2 line Y
3 line Z
line C
line D
There can be several such sections in the file that require the transformation.
Update 2 I didn't originally specify what should happen if there is more one section, for example:
line A
line B
@@inline-code-start
line X
line Y
line Z
@@inline-code-end
line C
line D
@@inline-code-start
line L
line M
line N
@@inline-code-end
My expectation would be that state would only need to be maintained within a given section, giving:
line A
line B
1 line X
2 line Y
3 line Z
line C
line D
1 line L
2 line M
3 line N
but, I think interpreting the problem as requiring the state to be kept across sections is valid, and useful in many contexts.
End Update 2
My first thought is to build a simple state machine that tracks what section we are in:
#!/usr/bin/bash
while read line
do
if [[ $line == @@inline-code-start* ]]
then
active=true
elif [[ $line == @@inline-code-end* ]]
then
active=false
elif [[ $active = true ]]
then
# pipe
echo $line | nl
else
# output
echo $line
fi
done
Which I run with:
cat test-inline-codify | ./inline-codify
This doesn't work since each call to nl
is independent, so the line numbers do not increase:
line A
line B
1 line X
1 line Y
1 line Z
line C
line D
My next attempt was to use a fifo:
#!/usr/bin/bash
mkfifo myfifo
nl < myfifo &
while read line
do
if [[ $line == @@inline-code-start* ]]
then
active=true
elif [[ $line == @@inline-code-end* ]]
then
active=false
elif [[ $active = true ]]
then
# pipe
echo $line > myfifo
else
# output
echo $line
fi
done
rm myfifo
This gives the correct output, but in the wrong order:
line A
line B
line C
line D
1 line 1
2 line 2
3 line 3
There's probably some caching going on.
Am I going about this all wrong? This seems like a pretty generic problem. I feel like there ought to be a simple pipeline that would solve this.
Best Answer
I would agree with you - it probably is a generic problem. Some common utilities have some facilities for handling it, though.
nl
nl
, for example, separates input into logical pages as-d
elimited by a two character section delimiter. Three occurrences on a line all alone indicate the start of a heading, two the body and one the footer. It replaces any of these found in input with a blank line in output - which are the only blank lines it ever printsI altered your example to include another section and put it in
./infile
. So it looks like this:Then I ran the following:
nl
can be told to accumulate state across logical pages, but it does not by default. Instead it will number the lines of its input according to styles, and by section. So-ha
means number all header lines and-bn
means no body lines - as it starts out in a body state.Until I learned this I used to use
nl
for any input, but after realizing thatnl
might distort output according to its default-d
elimiter\:
I learned to be more careful with it and started usinggrep -nF ''
for untested input instead. But another lesson learned that day was thatnl
can be very usefully applied in other respects - such as this one - if you just modify its input only a little - as I do withsed
above.OUTPUT
Here's some more about
nl
- do you notice above how all lines but the numbered ones start with spaces? Whennl
numbers lines it inserts a certain number of characters into the head of each. For those lines it doesn't number - even blanks - it always matches the indent by inserting (-w
idth count +-s
eparator len ) * spaces at the head of unnumbered lines. This allows you to reproduce the not-numbered content exactly by comparing it to the numbered content - and with little effort. When you consider thatnl
will divide its input into logical sections for you, and that you can insert arbitrary-s
trings at the head of each line it numbers, then it gets pretty easy to handle its output:The above prints...
GNU
sed
If
nl
is not your target application, then a GNUsed
cane
xecute an arbitrary shell command for you depending on a match.Above
sed
collects input in pattern space until it has enough to successfully pass the substitutionT
est and stopb
ranching back to the the:l
abel. When it does, ite
xecutesnl
with input represented as a<<
here-document for all of the rest of its pattern-space.The workflow is like this:
/^@@.*start$/!b
^
entire line$
does!
not/
match/
the above pattern, then it isb
ranched out of the script and autoprinted - so from this point on we are only working with a series of lines which began with the pattern.s//nl <<\\@@/
s//
field/
stands in for the last addresssed
attempted to match - so this command substitutes the entire@@.*start
line fornl <<\\@@
instead.:l;N
:
command defines a branch label - here I set one named:l
abel. TheN
ext command appends the next line of input to pattern space followed by a\n
ewline character. This is one of only a few ways to get a\n
ewline in ased
pattern space - the\n
ewline character is a sure delimiter to ased
der who has been doing it awhile.s/\(\n@@\)[^\n]*end$/\1/
s///
ubstitution can only be successful after a start is encountered and only on the first following occurrence of an end line. It will only act on a pattern space in which the final\n
ewline is immediately followed by@@.*end
marking the very end$
of pattern space. When it does act, it replaces the whole matched string with the\1
first\(
group\)
, or\n@@
.Tl
T
est command branches to a label (if provided) if a successful substitution has not occurred since the last time an input line was pulled into pattern space (as I do w/N
). This means that each time a\n
ewline is appended to pattern space which does not match your end delimiter, theT
est command fails and branches back to the:l
abel, which results insed
pulling in theN
ext line and looping until successful.e
When the substitution for the end match is successful and the script does not branch back for a failed
T
est,sed
wille
xecute a command thatl
ooks like this:You can see this for yourself by editing the last line there to look like
Tl;l;e
.It prints:
while ... read
One last way to do this, and maybe the most simple way, is to use a
while read
loop, but for good reason. The shell - (most especially abash
shell) - is typically pretty abysmal at handling input in large amounts or in steady streams. This makes sense, too - the shell's job is to handle input character by character and to call up other commands which can handle the bigger stuff.But importantly about its role there is that the shell must not
read
overmuch of the input - it is specified to not buffer input or output to the point that it consumes so much or doesn't relay enough in time that the commands it calls are left lacking - to the byte. Soread
makes for an excellent input test - toreturn
information about whether there is input remaining and you should call up the next command to read it - but it is otherwise generally not the best way to go.Here's an example, however, of how one might use
read
and other commands to process input in sync:The first thing that happens for each iteration is
read
pulls in a line. If it is successful it means the loop has not yet hit EOF and so in thecase
it matches a start delimiter thedo
block is immediately executed. Else,printf
prints the$line
itread
andsed
is called.sed
willp
rint every line until it encounters the start marker - when itq
uits input entirely. The-u
nbuffered switch is necessary for GNUsed
because it can buffer rather greedily otherwise, but - according to spec - other POSIXsed
s should work without any special consideration - so long as<infile
is a regular file.When the first
sed
q
uits, the shell executes thedo
block of the loop - which calls anothersed
that prints every line until it encounters the end marker. It pipes its output topaste
, because it prints line numbers each on their own line. Like this:paste
then pastes those together on:
characters, and the whole output looks like:These are just examples - anything could be done in either the test or do blocks here, but the first utility must not consume too much input.
All of the utilities involved read the same input - and print their results - each in their own turn. This kind of thing can be difficult to get the hang of - because different utilities will buffer more than others - but you can generally rely on
dd
,head
, andsed
to do the right thing (though, for GNUsed
, you need the cli-switch) and you should always be able to rely onread
- because it is, by nature, very slow. And that's why the above loop calls it only the one time per input block.