Ny ligature-aware alternative for “pdfgrep” in command line

I always use "pdfgrep" to search inside of multiple PDF files from the command line. But I met a problem: This ligature character "ﬁ" (see https://www.compart.com/en/unicode/U+FB01).
"ﬁ" is in the word "fixed", so I could not search the term "fixed point operator" with pdfgrep -iR 'fixed point operator'. However, when I open the file with PDF readers such as Foxit reader and Evince, "ﬁ" is split into "f" and "i", thus searchable. Is there any more reliable alternative for the "pdfgrep"? Or is there any option keywords in "pdfgrep" to expand the encoding?

The PDF file is http://direct.mit.edu/books/chapter-pdf/238450/9780262321037_can.pdf .

Ubuntu 20.04, amd64, kernel version Linux 5.6.0-1018-oem. pdfgrep has an option --unac. But if I install pdfgrep with sudo apt-get install pdfgrep, command --unac will report "pdfgrep: UNAC support disabled at compile time!"

pdfgrep:
  Installed: 2.1.2-1build1
  Candidate: 2.1.2-1build1
  Version table:
 *** 2.1.2-1build1 500
        500 http://mirrors.huaweicloud.com/ubuntu focal/universe amd64 Packages
        100 /var/lib/dpkg/status

1.6. Inﬁnite and σ-ﬁnite measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4. The general deﬁnition of the Lebesgue integral . . . . . . . . . . . . . . 118 2.6. Integration with respect to inﬁnite measures . . . . . . . . . . . . . . . . 124 3.5. Inﬁnite products of measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Best Answer

To solve this problem, you should first use pdftotext to find out what your ligature looks like in form of UTF-8, for example I run this:

pdftotext -f 11 -l 13 ~/Mathematics/Analysis/MeasureTheory.pdf text && cat text

and get a line of results looks like this

   1.6.  Inﬁnite and σ-ﬁnite measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

then I know fi is in fact a ring phone ☎ in terminal, howerver it renders as fi on browser.

So I continue with pdfgrep

pdfgrep --page-range=11-13 ﬁ ~/Mathematics/Analysis/MeasureTheory.pdf

Finally, of course I get desired results:

Best Answer

Related Solutions

Command-Line – Robust Tools for Processing CSV Files

Command Line – Best PDF Viewer for Command Line Only

Related Question