How to transparently cache any directory or mounted file system for reads and write back

cachecloudfuse

Say I mount some cloud storage (Amazon Cloud Drive in my case) with a FUSE client at /mnt/cloud. But because reading and writing files directly to /mnt/cloud is slow because it has to go over the internet, I want to cache the files that I'm reading from and writing to cloud storage. Since I might be writing a lot of data at a time, the cache should sit on my disk and not in RAM. But I don't want to replicate the entire cloud storage on my disk, because my disk may be too small.

So I want to have a cached view into /mnt/cloud mounted at /mnt/cloud_cache, which uses another path, say /var/cache/cloud as the caching location.

If I now read /mnt/cloud_cache/file, I want the following to happen:

Check whether file is cached at /var/cache/cloud/file.

  1. If cached: Check file in cache is up-to-date by fetching modtime and/or checksum from /mnt/cloud. If it's up-to-date, serve the file from the cache, otherwise go to 2.
  2. If not cached or cache is out-of-date: Copy /mnt/cloud/file to /var/cache/cloud/file and serve it from the cache.

When I write to /mnt/cloud_cache/file, I want this to happen:

  1. Write to /var/cache/cloud/file and record in a journal that file needs to be written back to /mnt/cloud
  2. Wait for writing to /var/cache/cloud/file to be done and/or previous write backs to /mnt/cloud to be completed
  3. Copy /var/cache/cloud/file to /mnt/cloud

I have the following requirements and constraints:

  • Free and open source
  • Ability to set cache an arbitrary cache location
  • Ability to cache an arbitrary location (probably some FUSE mount point)
  • Transparent caching, i.e. using /mnt/cloud_cache is transparent to the caching mechanism and works like any other mounted file system
  • Keeping a record of what needs to be written back (the cache might get a lot of data that needs to be written back to the original storage location over the course of days)
  • Automatic deletion of cached files that have been written back or have not been accessed in a while
  • Consistency (i.e. reflecting external changes to /mnt/cloud) isn't terribly important, as I will probably have only one client accessing /mnt/cloud at a time, but it would be nice to have.

I've spent quite some time looking for existing solutions, but haven't found anything satisfactory.

Best Answer

Try using catfs, a generic fuse caching filesystem I'm currently working on.

Related Question