Extract Named Groups of Regular Expression in LibreOffice Calc

libreoffice-calcregex

In Python 3, it is quite easy to extract text from named groups as shown in the following example:

import re
myStr = r"4000/2000/5000/7000"
reObj = re.compile(r"""(?P<g1>\d+)  # a capturing group named g1
                       /
                       (?P<g2>\d+)
                       /
                       (?P<g3>\d+)
                       /
                       (?P<g4>\d+)""", re.VERBOSE) 
matchObj = reObj.match(myStr)  # match the string to be searched
print(matchObj.group("g1"))  # 4000
print(matchObj.group("g2"))  # 2000 
print(matchObj.group("g3"))  # 5000 
print(matchObj.group("g4"))  # 7000

However, in LibreOffice Calc I simply couldn't get any clue (Calc does not even have a independent regex() function which gives a regex pattern). Position-based workarounds as in this post isn't what I need.

PLEASE GIVE ANSWERS INDEPENDENT OF POSITIONAL PARAMETERS and PLEASE EXEMPLIFY EXPLICITLY. e.g. MID() is not acceptable. Although the example given here is simple enough, but I need a general way to deal with a real situation that is far more complex.

Best Answer

In both Excel and Calc, the cleanest solution is to create a general-purpose regular expression macro. To do this in Calc, go to Tools -> Macros -> Organize Macros -> LibreOffice Basic and add the following code to Module1:

Function ReFind(findIn, patt, Optional group_param As Integer,  _
                Optional ignoreCase_param As Boolean)
    ' findIn - string or cell to search in
    ' patt - regexp string or cell containing regexp string
    ' group - which group to grab - analogy to \n in regexp syntax
    ' ignoreCase - false for case sensitive matches
    If IsMissing (group_param) Then
        group = 0
    Else
        group = group_param
    End If
    If IsMissing (ignoreCase_param) Then
        ignoreCase = False
    Else
        ignoreCase = ignoreCase_param
    End If
    oTextSearch = CreateUnoService("com.sun.star.util.TextSearch")
    oOptions = CreateUnoStruct("com.sun.star.util.SearchOptions")
    oOptions.algorithmType = com.sun.star.util.SearchAlgorithms.REGEXP
    If ignoreCase Then
        oOptions.transliterateFlags = _
            com.sun.star.i18n.TransliterationModules.IGNORE_CASE
    End If
    oOptions.searchString = patt
    oTextSearch.setOptions(oOptions)
    oFound = oTextSearch.searchForward(findIn, 0, Len(findIn))
    If oFound.subRegExpressions = 0 Then
        ReFind = "No results"
        MsgBox "No results"
        Exit Function
    ElseIf group >= oFound.subRegExpressions Then 
         ReFind = "No result for that group"
         MsgBox "No result for that group"
         Exit Function
    Else
         nStart = oFound.startOffset()
         nEnd = oFound.endOffset()
         ReFind = Mid(findIn, nStart(group) + 1, nEnd(group) - nStart(group))
    End If
End Function

Now you can use ReFind for any regular expressions needed in the spreadsheet. For example, in cell A1 enter 12345. In cell B1, enter the formula =REFIND($A$1,"(\d\d)(\d)",2). This will retrieve the third number, which is 3.

The code was adapted from https://forum.openoffice.org/en/forum/viewtopic.php?t=30502.

Note: It would be better to create the regular expression in python or java using an add-in. However that requires XML declaration files in an extension, which takes more time to set up.

Related Question