How to generate email statistics from mutt header cache

mutt

When configured accordingly (set header_cache=) mutt saves the mail headers in a cache file. That could be used to generate mail statistics. Does anybody know something about the file format? Are there any tools available to extract the information contained? (Besides strings, grep, awk and the like)

Best Answer

Short answer:

it's entirely possible that the cache will not be comprehensive. If you delete mail and hcache later recomputes the header cache for that mailbox, your stats will not include mail from before the deletion.

If you don't have access to the mail logs for your server, do you have access to a filter mechanism, e.g. procmail? You could use that to generate an alternative log for analysis.

Otherwise, can you poll your mailbox with a program that can generate a log of mail received? Something like an offlineimap filter, or fetchmail/retchmail combined with some hashing and caching.

Longer answer:

The cache file is a DBM-style database. Depending on the exact build options for your mutt, it could be one of QDBM, tokyo cabinet, gdbm or Berkeley DB (BDB); which all implement a variation of BDB's API.

I believe that it is unlikely you can reliably read the DB unless you use the right library implementation. ldd tells me my local mutt uses the tokyo cabinet implementation:

$ ldd /usr/bin/mutt
…
libtokyocabinet.so.8 => /usr/lib/libtokyocabinet.so.8 (0xb74f2000)
…

You would then need to write a program, using that library, to query the BDB stored within the cache file. There are bindings for Perl, Ruby, Lua, Java, and of course C.

It would appear that headers are stored as values in the DB, indexed by a CRC. From what I can tell, the CRC is derived from the path to a mailbox, which implies that the stored headers are the headers for all mail in that mailbox. So your program is essentially going to end up with a buffer containing all headers for all mail in a given mailbox. I don't think it will be much more useful than pulling the headers from all mail currently in your mailbox (and given the "short answer" above, not guaranteed to be more reliable).

Related Question