Sed/awk subscript numbers in chemical formulas using markup tags

I have hundreds of text files that include chemical formulas together with narrative including numerical values. The formulas are always preceded by white spaces but can be followed by white spaces, commas, periods, etc.

The problem is: the formulas are not formatted to display numbers as subscripts e.g.:

H2SO4, C5H11OH.

I want to format the subscripts as HTML tags, e.g.:

H<sub>2</sub>SO<sub>4</sub>, C<sub>5</sub>H<sub>11</sub>OH

So that subscripts render in HTML, e.g.:

H₂SO₄, C₅H₁₁OH

I have toyed with accomplishing this with Java, php, etc., but the implementations are necessarily messy and awkward. I suspect that there is an elegant sed/awk approach.

Clearly, part of the solution is to craft a regular expression that matches a letter followed by one or more digits as a formula detection mechanism (there may be false positives that I will manually correct later). Then, given a formula so identified, a sed replacement needs to precede each digit or sequence of digits with the sub tag and follow it with a sub tag closure.

There must be a one-liner that does this, but I'm over my head.

Any ideas?

Sed/awk subscript numbers in chemical formulas using markup tags

Best Answer

Related Question

Best Answer

Related Solutions

Bash – Formatting numbers using awk / print

Bash – How to populate end tags using sed, awk or any other command

Related Question