Search ﬁ (one character) as fi (two characters) in mdfind

pdfspotlight

In the pdf below, the two characters f and i is contracted into the one-character ligature ﬁ as in "significance". So that when I searched "Brown Adipose Tissue: Function and Physiological Significance" (f and i are two characters) by mdfind, I did find this pdf. Is there a way to search such PDF files with mdfind by "… Significance" where "f" and "i" do not ligature?

https://journals.physiology.org/doi/full/10.1152/physrev.00015.2003

$ mdfind -onlyin . "(kMDItemTextContent=='Brown Adipose Tissue: Function and Physiological Significance'c)" # two characters.
$ mdfind -onlyin . "(kMDItemTextContent=='Brown Adipose Tissue: Function and Physiological Signiﬁcance'c)" # one character
/Users/xxx/Downloads/x/physrev.00015.2003.pdf

Note that the following should not be used as it searches in kMDItemTitle as well.

$ mdfind -onlyin . 'Brown Adipose Tissue: Function and Physiological Significance' # two characters
/Users/xxx/Downloads/x/physrev.00015.2003.pdf

Best Answer

Although the fi characters are displayed as a single ligature glyph, they are understood within the PDF as distinct letters. (And within every other text app such as TextEdit, Pages, Safari, etc, which will also display ligatures and understand them as separate characters.)

I can search in Safari or Preview within the PDF for the letters fi, and get the ligature in the results:

I can also copy and paste the text, or export it from the PDF, and the text has separate characters for that ligature.

However, results using Spotlight do seem to be more variable. If I create a PDF from TextEdit with the word 'office' using ligature glyphs, that word is not be found in a Spotlight search. If I do the same from Affinity Publisher, the word is found.

I have other PDFs with ligature glyphs that Spotlight can search.

It is of course also possible to produce a PDF where the underlying chars are not preserved.

TL;DR: it seems that Spotlight is choosy about font encoding when indexing PDF text content. Text encoded with a Type 1 Roman encoding does not produce the correct result.

So your options are to write a shell script that offers up the ligated Unicode glyphs whenever the relevant combination of characters occur (fi, fl, ffi, ffl, ct, st), and search PDFs using both forms; or use a non-Spotlight method of querying the text in the PDF.

Related Solutions

Can “mdfind” search for phrases and not just unordered words

You need to escape your quotes like so:

mdfind \"I love Apple\" -onlyin ~/Documents

This results in just the one document being found:

~/Documents/test1.txt

Without escaping them, I don't think the quotes actually get passed to the mdfind command, they're just interpreted by your shell to say that I love Apple is a single argument. With the backslash-escaping, the argument then includes the quote characters.

Spotlight – How to Remove All mdfind Results with Spaces in Pathnames

You can use mdfind -0 to print a null character after each path. Then, xargs -0 to parse the list on each null character instead of using the default whitespace.

Best Answer

Related Solutions

Can “mdfind” search for phrases and not just unordered words

Spotlight – How to Remove All mdfind Results with Spaces in Pathnames

Related Question