How to remove newline character between two strings \n in unix

htmlnewlinessedtext processing

I want to remove the newline between two html tags which exists as follows:

<font>
</font>

I want to remove the newline character such that it becomes:

<font></font>

Also, there might be cases where there are more than one newline(s):

<font>

</font>

That also I want to remove to make it look like:

<font></font>

One more scenario,

if the pattern is like:

<font>
This is a text
</font>

After, conversion it should become:

<font>This is a text</font>

All the above scenarios are resolved, if we just truncate only the newline between two html tags. We should not be considering any white spaces.

There are couple of ways I have found it using sed, but it is very time consuming and very very efficient performance wise, particularly if the file has 1000+ html tags.

Best Answer

This sed command should help you:

sed -e ':1;/<font>[[:space:]]*$/{N;s#<font>[[:space:]]\+</font>#<font></font>#g;b1}' file

The command is looking for <font> tag that is followed by whitespace up to the end of line. Then the next line is pulled into the pattern space; then the replacement of a possibly existing sequence <font>[[:space:]]\+</font> is performed and script restarts from the beginning. If the pattern space does not match the address /<font>[[:space:]]*$/, i.e. some non-space content is present after a <font> tag, then the pattern space is printed out and cleared by the end of sed script and the process restarts.

Edit: Performance measurement.

I filled a file with the following content repeated 10k times:

<font>
dejidewji
</font>
<font>



</font><font>





</font>

totally, 620Kb. The timings of the script above on 1.4Gz A8-4500M are:

real    0m0.361s
user    0m0.356s
sys 0m0.005s

Edit2:

Your last question update is much easier solved by perl and performance is 10 times better, as showed the other answer:

perl -0777 -pe 's|<font>\s+|<font>|g;s|\s+</font>|</font>|g' file

Credits to @spasic

Related Question