Excel – How to identify duplicates within text in two columns in Excel

microsoft excelworksheet-function

I have a worksheet where column A has various names in varying formats:
A1   John Smith
A2   Jones, Mary
A3   Sally Gomez
A4   The Gonzalez family

Column B has similar data:
B1   The Smith Family Trust
B2   Bob and Mary Jones
B3   Blackwell, John
B4   Luz Gonzalez

I would like to identify the instances where the same last name is found in column A and column B. In the examples above, the formula, if placed in column C, would result in

C1   TRUE   (because "Smith" is found in both A1 and B1)
C2   TRUE   (because "Jones" is found in both A2 and B2)
C3   FALSE   (because there are no common words between A3 and B3)
C4   TRUE   (because "Gonzalez" is found in both A4 and B4)

Is this even possible?

Best Answer

Given your comments as well as your question, it seems you want to return TRUE if any word in one phrase matches a word in the adjacent phrase. One way to do this is with a User Defined Function (VBA). The following excludes any words that are in arrExclude, which you can add to as you see fit. It will also exclude any characters that are not letters, digits or spaces, and any words that consist of just a single character.

See if this works for you.

Another option would be take a look at the free fuzzy lookup add-in provided by MS for excel versions 2007 and later.

To enter this User Defined Function (UDF), alt-F11 opens the Visual Basic Editor. Ensure your project is highlighted in the Project Explorer window. Then, from the top menu, select Insert/Module and paste the code below into the window that opens.

To use this User Defined Function (UDF), enter a formula like

=WordMatch(A1,B1)

in some cell.

EDIT2: Find Matches segment changed to see if it works better on Mac


Option Explicit
Option Base 0
Option Compare Text
Function WordMatch(S1 As String, S2 As String) As Boolean
    Dim arrExclude() As Variant
    Dim V1 As Variant, V2 As Variant
    Dim I As Long, J As Long, S As String
    Dim RE As Object
    Dim sF As String, sS As String

'Will also exclude single letter words
arrExclude = Array("The", "And", "Trust", "Family", "II", "III", "Jr", "Sr", "Mr", "Mrs", "Ms")

'Remove all except letters, digits, and spaces
'remove extra spaces
'Consider whether to retain hyphens

Set RE = CreateObject("vbscript.regexp")
With RE
    .Pattern = "[^A-Z0-9 ]+|\b\S\b|\b(?:" & Join(arrExclude, "|") & ")\b"
    .Global = True
    .ignorecase = True
End With

With WorksheetFunction
    V1 = Split(.Trim(RE.Replace(S1, "")))
    V2 = Split(.Trim(RE.Replace(S2, "")))
End With

'Find Matches
If UBound(V1) <= UBound(V2) Then
    sS = " " & Join(V2) & " "
    For I = 0 To UBound(V1)
        sF = " " & V1(I) & " "
        If InStr(sS, sF) > 0 Then
            WordMatch = True
            Exit Function
        End If
    Next I
Else
    sS = " " & Join(V1) & " "
    For I = 0 To UBound(V2)
        sF = " " & V2(I) & " "
        If InStr(sS, sF) > 0 Then
            WordMatch = True
            Exit Function
        End If
    Next I
End If

WordMatch = False

End Function

EDIT: Here is a screenshot of the results, using both your original examples, and also the examples you gave in a comment below where you indicated you were having a problem.

enter image description here

Related Question