Memoize (cache) for command-line programs

cachecommand linesoftware-rec

I sometimes wind up running the same, fairly-expensive command over and over again, to get the same output. E.g., ffprobe to get information about a media file. Given the same input, the same output should be produced always—so caching should be possible.

I've see Memoizing/caching command line output but I'm looking for a more thorough implementation: in particular, that one just seems to compare the command line—if one of the files passed is modified, it won't notice. (It also has a bunch of fixed-length buffers which make me suspicious and is oddly a daemon.)

Before I go off and write my own, I'm curious if one already exists. The key requirements:

  • Must re-run the command if any of the input files (on the command line) change
  • Must re-run the command if any of the command line options change
  • I'm OK (and honestly expect) commands to be run "non-interactive": e.g., with /dev/null as stdin, and a two different files as stdout and stderr.
  • If the command errors out, I'm OK with either that being cached along with the exit code or alternatively it just not being cached at all.
  • Should return cached content as often as possible, given the above. But correctness comes first.
  • Preferable if the cache can be shared between several machines (all under common control) e.g., via NFS.

Basically what I'm thinking of doing, if I write my own, is (skipping some locking and error checking for briefness): take command line + stat results of each item on the command line (error or dev, inode, size, mtime) and pass that whole mess through SHA-512 or SHA-256. That'll give a key that is a fixed size but will change if the command or the files change (unless someone makes a size- and mtime-preserving change, in which case they deserve what they get). Check if that key is in the cache directory. If it already exists, copy their contents to stdout & stderr. Otherwise, run the command in a subprocesses with stdin /dev/null and two files as stdout and stderr. If successful, put the files in to the cache directory. Then copy their contents to stdout and stderr. If it turns out I wind up writing my own, design feedback welcome. And the result will be free software.

Best Answer

There are so many cases where what you want won't work that you won't find a generic tool that gives really good results:

  • Commands that access files that aren't on the command line. (locate myfile)
  • Commands that access the network. (wget http://news.example.com/headlines)
  • Commands that depend on the time. (date)
  • Commands that have random output. (pwgen)

If you handle the task of deciding which commands to apply the tool on, then what you want is a build tool: a tool that runs commands if their output is not up to date. The venerable make won't be very nice: you have to define the dependencies manually, in particular you need to carefully separate caches for different commands and manually revoke caches if you change the command, and you need to store each cache in a separate file, which is inconvenient. One of the many alternatives may be more up to the task, perhaps SCons which supports both checksum- and timestamp-based dependency analysis, has a caching mechanism on top of that, and can be tweaked by writing Python code.

Related Question