Mac – How to search the contents of PDF email attachments that are backed up to Time Machine

emailmail.apppdfsearchtime-machine

So I have quite a specific situation where I need to dig out an old email I have backed up to Time Machine, but I can't remember when I received it. Could be anywhere between 6 months to 1 year ago.

Following the advice here, I ran the following in Terminal:

find "/Volumes/Time Machine" -name "*.emlx" -exec grep -l "search term" {} \;

Which found me several thousand results matching "search term" (the subject line).

The problem is, the contents of the body of those messages are all identical. It's the contents of the attachment, a PDF document, I need to search for a specific term for.

Any idea how I can do a search similar to the above, but searching the contents of the PDF attachment?

Best Answer

To search the PDF contents embedded in an email using grep you are facing at least one if not two challenges.

The first is that files embedded in an email are not preserved in their raw form and instead are encoded into plain text for transmission in the email message. The MIME format commonly used is Base64 but it's not always the case. You can find more details on base64, including encode/decode methods here: http://en.wikipedia.org/wiki/Base64

Assuming the attachment was encoded in base64, it's still not as simple as taking your search string and turing it into a base64 representation because the resulting encoding depends on where in the input string your target string appears. base64 takes every 6 bits of the input stream and turns it into a plain text character. Assuming the PDF document is using a simple 8-bit encoding for the string, 3 characters in the PDF will turn into 4 encoded characters. A quick illustration of how character position can change things:

% echo "123456789" | base64
MTIzNDU2Nzg5Cg==
% echo "0123456789" | base64
MDEyMzQ1Njc4OQo=

In the first example, 123, 456, and 789 turn into MTIz, NDU2 and Nzg5 respectively. In the second, the encoded tuples are 012 => MDEy, 345 => MzQ1, ... etc.

Your search string could start at the beginning, middle or end of one of these tuples. If you search for a long enough string you could carve out 3 search strings from the middle that would be searchable, each must be a multiple of 3 characters long. For example, if I wanted to search for the string 123456789, I could base64 encode three possible search candidates: 123456789, 234567 and 345678. One of the base encodings should show up in the encoded PDF file if the plain text string 123456789 is present.

Now we get to the 2nd big problem. Through this we've assumed that the unencoded contents of the PDF file is in UTF-8, plain text encoding. If you look at the contents of an actual PDF file, you'll find it has all sorts of internal encodings that include font indicators, text encoded in UTF-16 or UTF-32 and so on. It's quite likely that your simple search phrase wouldn't appear in the PDF data exactly as you expect. There can be extra content in the PDF between characters, between words, and then there are also line breaks so if you are searching for a phrase that crosses a line break, more complication in building your target search string.

There are a couple other things you could try though so all is not lost.

If you know the name of the PDF, you could search for that instead. The file name should appear in the MIME text of the email message in UTF-8 and thus is searchable.

You might also get lucky and instead of searching the emails, search for PDF files. OSX Mail extracts attachments and stores them to a mail download directory. I'm running Mavericks and Mail v7.3 and my attachments get written to subdirectories under ~/Library/Containers/com.apple.mail

By the way, your grep command is looking for an exact match so if you are unsure of upper/lower case usage you should add the '-i' option for a case insensitive search.