Word – How to programmatically iterate through subscripts,superscripts and equations found in a Word document

microsoft wordmicrosoft-word-2007vba

I have a few Word documents, each containing a few hundred pages of scientific data which includes:

  • Chemical formulae (H2SO4 with all proper subscripts & superscripts)
  • Scientific numbers (exponents formatted using superscripts)
  • Lots of Mathematical Equations. Written using mathematical equation editor in Word.

Problem is, storing this data in Word is not efficient for us. So we want to store all this information in a database (MySQL). We want to convert the formatting to LaTex.

Is there any way to iterate through all the subcripts, superscripts and equations within a Word document using VBA?

Best Answer

Yes there is. I would sugest using Powershell as it handles Word files quite well. I think i will be the easiest way.

More on Powershell vs Word automation in here: http://www.simple-talk.com/dotnet/.net-tools/com-automation-of-office-applications-via-powershell/

I have digged a little deeper and i found this powershell script:

param([string]$docpath,[string]$htmlpath = $docpath)

$srcfiles = Get-ChildItem $docPath -filter "*.doc"
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatFilteredHTML");
$word = new-object -comobject word.application
$word.Visible = $False

function saveas-filteredhtml
    {
        $opendoc = $word.documents.open($doc.FullName);
        $opendoc.saveas([ref]"$htmlpath\$doc.fullname.html", [ref]$saveFormat);
        $opendoc.close();
    }

ForEach ($doc in $srcfiles)
    {
        Write-Host "Processing :" $doc.FullName
        saveas-filteredhtml
        $doc = $null
    }

$word.quit();

Save it as .ps1 and start it with:

convertdoc-tohtml.ps1 -docpath "C:\Documents" -htmlpath "C:\Output"

It will save all the .doc file from specified directory, as the html files. So i have a doc file in which i have your H2SO4 with subscripts and after powershell convertion the output is following:

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 14 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin-top:0in;
    margin-right:0in;
    margin-bottom:10.0pt;
    margin-left:0in;
    line-height:115%;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";}
.MsoChpDefault
    {font-family:"Calibri","sans-serif";}
.MsoPapDefault
    {margin-bottom:10.0pt;
    line-height:115%;}
@page WordSection1
    {size:8.5in 11.0in;
    margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
    {page:WordSection1;}
-->
</style>

</head>

<body lang=EN-US>

<div class=WordSection1>

<p class=MsoNormal><span lang=PL>H<sub>2</sub>SO<sub>4</sub></span></p>

</div>

</body>

</html>

As you can see subscripts have their own tags in HTML so only thing that is left is to parse the file in bash or c++ to cut from body to /body , change the to LATEX and remove the rest of HTML tags afterwards.

Code from http://blogs.technet.com/b/bshukla/archive/2011/09/27/3347395.aspx


So i've developed a parser in C++ to look for HTML subscript and replace it with LATEX subscript.

The code:

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <vector>

using namespace std;

 vector < vector <string> > parse( vector < vector <string> > vec, string filename )
{
        /*
                PARSES SPECIFIED FILE. EACH WORD SEPARATED AND
                PLACED IN VECTOR FIELD.

                REQUIRED INCLUDES:
                                #include <iostream>
                                #include <fstream>
                                #include <string>
                                #include <sstream>
                                #include <vector>

            EXPECTS: TWO DIMENTIONAL VECTOR
                     STRING WITH FILENAME
            RETURNS: TWO DIMENTIONAL VECTOR
                     vec[lines][words]
        */
        string vword;
        ifstream vfile;
        string tmp;

         // FILENAME CONVERSION FROM STING
        //  TO CHAR TABLE

        char cfilename[filename.length()+1];
        if( filename.length() < 126 )
        {
                for(int i = 0; i < filename.length(); i++)
                                cfilename[i] = filename[i];
                cfilename[filename.length()] = '\0';
        }
        else return vec;

         // OPENING FILE
        //
        vfile.open( cfilename );
        if (vfile.is_open())
        {
                while ( vfile.good() )
                {
                        getline( vfile, vword );
                        vector < string > vline;
                        vline.clear();

                        for (int i = 0; i < vword.length(); i++)
                        {
                                tmp = "";
                                 // PARSING CONTENT. OMITTING SPACES AND TABS
                                //
                                while (vword[i] != ' ' && vword[i] != ((char)9) && i < vword.length() )
                                        tmp += vword[i++];
                                if( tmp.length() > 0 ) vline.push_back(tmp);
                        }
                        if (!vline.empty())
                                vec.push_back(vline);
                }
                vfile.close();
        }
        else cout << "Unable to open file " << filename << ".\n";
        return vec;
}

int main()
{
        vector < vector < string > > vec;
        vec = parse( vec, "parse.html" );

        bool body = false;
        for (int i = 0; i < vec.size(); i++)
        {
                for (int j = 0; j < vec[i].size(); j++)
                {
                        if ( vec[i][j] == "<body") body=true;
                        if ( vec[i][j] == "</body>" ) body=false;
                        if ( body == true )
                        {
                                for ( int k=0; k < vec[i][j].size(); k++ )
                                {
                                        if (k+4 < vec[i][j].size() )
                                        {
                                                if (    vec[i][j][k]   == '<' &&
                                                        vec[i][j][k+1] == 's' &&
                                                        vec[i][j][k+2] == 'u' &&
                                                        vec[i][j][k+3] == 'b' &&
                                                        vec[i][j][k+4] == '>' )
                                                {

                                                        string tmp = "";
                                                        while (vec[i][j][k+5] != '<')
                                                        {
                                                                tmp+=vec[i][j][k+5];
                                                                k++;
                                                        }
                                                        tmp = "_{" + tmp + "}";
                                                        k=k+5+5;
                                                        cout << tmp << endl;;
                                                }
                                                else cout << vec[i][j][k];
                                        }
                                        else cout << vec[i][j][k];
                                }
                                cout << endl;
                        }
                }
        }
        return 0;
}

For the html file:

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 14 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin-top:0in;
        margin-right:0in;
        margin-bottom:10.0pt;
        margin-left:0in;
        line-height:115%;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
.MsoChpDefault
        {font-family:"Calibri","sans-serif";}
.MsoPapDefault
        {margin-bottom:10.0pt;
        line-height:115%;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
-->
</style>

</head>

<body lang=EN-US>

<div class=WordSection1>

<p class=MsoNormal><span lang=PL>H<sub>2</sub>SO<sub>4</sub></span></p>

</div>

</body>

</html>

The output is:

<body
lang=EN-US>
<div
class=WordSection1>
<p
class=MsoNormal><span
lang=PL>H_{2}
SO_{4}
</span></p>
</div>

It's not ideal of course, but treat is as proof of concept.

Related Question