Command Line Tool to Batch Convert .EML/.EMLX/.MBOX to Searchable PDFs

emailfile conversionocrpdf

I need to convert about 500k emails into searchable PDFs. By 'searchable' I mean that macOS will be able to scan them for specific words rather than simply treating them as an image. My searches, thus far, for a tool to do this have ended in proprietary database apps and over-priced sketchball x-to-pdf converters which basically perform the built-in macOS functionality of Print To PDF. Is there a single tool or two complementary tools that could be used together in Terminal to just batch convert all the emails to searchable PDFs?

Best Answer

I had to do this with ~180 emails, and I used a command tool I found on GitHub that converts .eml to .pdf via .html: https://github.com/nickrussler/eml-to-pdf-converter

It takes a little while to convert each .eml file - 22 minutes for 186 emails with lots of images - so it's probably not helpful for a 500k email task. (Maybe if you're reeeally not in a rush and not afraid of multiprocessing!) If it is helpful for you or anyone else, though, here's how I got it to work in the bash command line:

  1. git clone the repo

  2. Install the wkhtmltopdf tool from binary (installing with pip is insufficient) from here: https://wkhtmltopdf.org/downloads.html

  3. From within the cloned repo, generate the email converter .jar file: ./gradlew shadowJar

  4. Run for loop to convert every file in the .mbox (or a directory of .eml):

for file in /path/to/mailbox.mbox/*; 
do
   java -jar ./build/libs/emailconverter-2.0.1-all.jar "$file"; 
done