Search fi (one character) as fi (two characters) in mdfind

pdfspotlight

In the pdf below, the two characters f and i is contracted into the one-character ligature fi as in "significance". So that when I searched "Brown Adipose Tissue: Function and Physiological Significance" (f and i are two characters) by mdfind, I did find this pdf. Is there a way to search such PDF files with mdfind by "… Significance" where "f" and "i" do not ligature?

https://journals.physiology.org/doi/full/10.1152/physrev.00015.2003

$ mdfind -onlyin . "(kMDItemTextContent=='Brown Adipose Tissue: Function and Physiological Significance'c)" # two characters.
$ mdfind -onlyin . "(kMDItemTextContent=='Brown Adipose Tissue: Function and Physiological Significance'c)" # one character
/Users/xxx/Downloads/x/physrev.00015.2003.pdf

Note that the following should not be used as it searches in kMDItemTitle as well.

$ mdfind -onlyin . 'Brown Adipose Tissue: Function and Physiological Significance' # two characters
/Users/xxx/Downloads/x/physrev.00015.2003.pdf

Best Answer

Although the fi characters are displayed as a single ligature glyph, they are understood within the PDF as distinct letters. (And within every other text app such as TextEdit, Pages, Safari, etc, which will also display ligatures and understand them as separate characters.)

I can search in Safari or Preview within the PDF for the letters fi, and get the ligature in the results:

Safari find

I can also copy and paste the text, or export it from the PDF, and the text has separate characters for that ligature.

However, results using Spotlight do seem to be more variable. If I create a PDF from TextEdit with the word 'office' using ligature glyphs, that word is not be found in a Spotlight search. If I do the same from Affinity Publisher, the word is found.

I have other PDFs with ligature glyphs that Spotlight can search.

It is of course also possible to produce a PDF where the underlying chars are not preserved.

TL;DR: it seems that Spotlight is choosy about font encoding when indexing PDF text content. Text encoded with a Type 1 Roman encoding does not produce the correct result.

So your options are to write a shell script that offers up the ligated Unicode glyphs whenever the relevant combination of characters occur (fi, fl, ffi, ffl, ct, st), and search PDFs using both forms; or use a non-Spotlight method of querying the text in the PDF.