I have a few Word documents, each containing a few hundred pages of scientific data which includes:
- Chemical formulae (H2SO4 with all proper subscripts & superscripts)
- Scientific numbers (exponents formatted using superscripts)
- Lots of Mathematical Equations. Written using mathematical equation editor in Word.
Problem is, storing this data in Word is not efficient for us. So we want to store all this information in a database (MySQL). We want to convert the formatting to LaTex.
Is there any way to iterate through all the subcripts, superscripts and equations within a Word document using VBA?
Best Answer
Yes there is. I would sugest using Powershell as it handles Word files quite well. I think i will be the easiest way.
More on Powershell vs Word automation in here: http://www.simple-talk.com/dotnet/.net-tools/com-automation-of-office-applications-via-powershell/
I have digged a little deeper and i found this powershell script:
Save it as .ps1 and start it with:
It will save all the .doc file from specified directory, as the html files. So i have a doc file in which i have your H2SO4 with subscripts and after powershell convertion the output is following:
As you can see subscripts have their own tags in HTML so only thing that is left is to parse the file in bash or c++ to cut from body to /body , change the to LATEX and remove the rest of HTML tags afterwards.
So i've developed a parser in C++ to look for HTML subscript and replace it with LATEX subscript.
The code:
For the html file:
The output is:
It's not ideal of course, but treat is as proof of concept.