Replace a string containing newline characters

grepnewlinessedtext processing

With the bash shell, in a file with rows like the following ones

first "line"
<second>line and so on

I would like to replace one or more occurrences of "line"\n<second> with other characters and obtain each time:

first other characters line and so on

So I have to replace a string both with special characters such as " and < and with a newline character.

After searching between the other answers, I found that sed can accept newlines in the right-hand side of the command (so, the other characters string), but not in the left.

Is there a way (simpler than this) to obtain this result with sed or grep?

Best Answer

Three different sed commands:

sed '$!N;s/"[^"]*"\n<[^>]*>/other characters /;P;D'

sed -e :n -e '$!N;s/"[^"]*"\n<[^>]*>/other characters /;tn'

sed -e :n -e '$!N;/"$/{$!bn' -e '};s/"[^"]*"\n<[^>]*>/other characters /g'

They all three build on the basic s///ubstitution command:

s/"[^"]*"\n<[^>]*>/other characters /

They also all try to take care in their handling of the last line, as seds tend to differ on their output in edge cases. This is the meaning of $! which is an address matching every line that is !not the $last.

They also all use the Next command to append the next input line to pattern space following a \newline character. Anyone who has been seding for a while will have learned to rely on the \newline character - because the only way to get one is to explicitly put it there.

All three make some attempt to read in as little input as possible before taking action - sed acts as soon as it might and needn't read in an entire input file before doing so.

Though they do all N, they all three differ in their methods of recursion.

First Command

The first command employs a very simple N;P;D loop. These three commands are built-in to any POSIX-compatible sed and they complement one another nicely.

  • N - as already mentioned, appends the Next input line to pattern-space following an inserted \newline delimiter.
  • P - like p; it Prints pattern-space - but only up-to the first occurring \newline character. And so, given the following input/command:

    • printf %s\\n one two | sed '$!N;P;d'
  • sed Prints only one. However, with...

  • D - like d; it Deletes pattern-space and begins another line-cycle. Unlike d, D deletes only up to the first occurring \newline in pattern-space. If there is more in pattern-space following \newline character, sed begins the next line cycle with what remains. If the d in the previous example were replaced with a D, for example, sed would Print both one and two.

This command recurses only for lines which do not match the s///ubstitution statement. Because the s///ubstitution removes the \newline added with N, there is never anything remaining when sed Deletes pattern-space.

Tests could be done to apply the P and/or D selectively, but there are other commands which fit better with that strategy. Because the recursion is implemented to handle consecutive lines which match only part of the replacement rule, consecutive sequences of lines matching both ends of the s///ubstitution do not work well.:

Given this input:

first "line"
<second>"line"
<second>"line"
<second>line and so on

...it prints...

first other characters "line"
<second>other characters line and so on

It does, however, handle

first "line"
second "line"
<second>line

...just fine.

Second Command

This command is very similar to the third. Both employ a :branch/test label (as is also demonstrated in Joeseph R.'s answer here) and recurse back to it given certain conditions.

  • -e :n -e - portable sed scripts will delimit a :label definition with either a \newline or a new inline -execution statement.
    • :n - defines a label named n. This can be returned to at any time with either bn or tn.
  • tn - the test command returns to a specified label (or, if none is provided, quits the script for the current line-cycle) if any s///ubstitution since either the label was defined or since it was last called tests successful.

In this command the recursion occurs for the matching lines. If sed successfully replaces the pattern with other characters, sed returns to the :n label and tries again. If a s///ubstitution is not performed sed autoprints pattern-space and begins the next line-cycle.

This tends to handle consecutive sequences better. Where the last one failed, this prints:

first other characters other characters other characters line and so on

Third Command

As mentioned, the logic here is very similar to the last, but the test is more explicit.

  • /"$/bn - this is sed's test. Because the branch command is a function of this address, sed will only branch back to :n after a \newline is appended and pattern-space still ends with a " double-quote.

There is as little done between N and b as possible - in this way sed can very quickly gather exactly as much input as necessary to ensure that the following line cannot match your rule. The s///ubstitution differs here in that it employs the global flag - and so it will do all necessary replacements at once. Given identical input this command outputs identically to the last.

Related Question