Sed/awk subscript numbers in chemical formulas using markup tags

awkhtmlscriptingtext processing

I have hundreds of text files that include chemical formulas together with narrative including numerical values. The formulas are always preceded by white spaces but can be followed by white spaces, commas, periods, etc.

The problem is: the formulas are not formatted to display numbers as subscripts e.g.:

H2SO4, C5H11OH.

I want to format the subscripts as HTML tags, e.g.:

H<sub>2</sub>SO<sub>4</sub>, C<sub>5</sub>H<sub>11</sub>OH

So that subscripts render in HTML, e.g.:

H2SO4, C5H11OH

I have toyed with accomplishing this with Java, php, etc., but the implementations are necessarily messy and awkward. I suspect that there is an elegant sed/awk approach.

Clearly, part of the solution is to craft a regular expression that matches a letter followed by one or more digits as a formula detection mechanism (there may be false positives that I will manually correct later). Then, given a formula so identified, a sed replacement needs to precede each digit or sequence of digits with the sub tag and follow it with a sub tag closure.

There must be a one-liner that does this, but I'm over my head.

Any ideas?

Best Answer

E.g.:

sed -r 's:([A-Za-z])([0-9]+):\1<sub>\2</sub>:g'  

should do the job.

(Match a letter followed by a group of digits and remember it as \1 and \2. Replace all of that by the same letter (\1) plus the digit group (\2) enclosed in the sub tag.)

Related Question