Regular expression to find abbreviations

find and replacelibreofficelibreoffice-writerregex

(This is a spinoff from a question I asked earlier.)

I am attempting to devise a RegEx for LibreOffice Writer that finds all abbreviations in my PhD thesis. Currently I have the following:

\b(?:[A-Z]){2,}

This almost does the job, as it identifies all words beginning with more than one capital letter. However, I have some abbreviations it does not catch, namely these:

CoE RoR RoC

Ideally I would like a RegEx that identifies all words with at least two capital letters, although they don't have to be at the beginning of the word. But I'm at a loss trying to create it. Can anyone point me in the right direction?

Best Answer

I interpreted the question a little differently from Jim K. Assuming that all abbreviations start with a capital letter and contain at least one more capital letter anywhere in the word, you don't have to add much to your existing regular expression:

\b(?:[A-Z][a-z]*){2,}

Capital letters are paired off with any number of lowercase letters, effectively eliminating the space between capital letters in a word. At least two of these pairs means that there are at least two capital letters.

If abbreviations must contain at least two capital letters but do not have to start with one, add an extra lowercase letter check in front of the capital letter check:

\b(?:[a-z]*[A-Z][a-z]*){2,}

Both of these were tested against Jim's test text. (thanks, Jim!)

Related Question