I've been googling this a lot, but I cannot get sufficient info on this one. The rule of thumb seems to be 5gb of RAM for 1TB of storage. But what is storage actually? Physical or logical used?
Let's say I have a 6TB harddrive, no dedupe, no compression. I have 6TB of actual data. Let's assume it'd dedupe 2:1, down to 3TB of data. Would we (approximately) require 3 * 5GB of memory, or 6 * 5GB?
As I understand it, it's dependent on a record. Since I cannot store more than 6TB of actual records on the disk, about 30GB ought to be enough, no matter compression / deduplication ratio, of course depending on the actual record sizes?
The thing is, that we'd like to calculate what's cheaper: Replace 6*6TB disks (3x onsite storage/mirror/hot spare, 3x offsite, we don't have more slots available in those boxes) with larger ones for backups, or buy some RAM for both boxes.
(Disclaimer: I'm not a sysadmin, but someone needed to put that hat on, so we can continue to have backups.)
Best Answer
While user121391's answer is mostly correct, the 1/4 limit for meta data is no longer the case/has not been the case for a long time:
First of all, the zfs_arc_meta_limit (the amount of caching memory that may be used for meta data, including the dedup table) has always been tunable (iirc). So even in very old ZFS versions where 25% might have been the default, you could use that setting to tune the amount of cache available for metadata. In case of a backup system where most of the user data is rarely accessed, >=75% for meta data + <=25% for user data might be more appropriate. Please keep in mind, that said tunable is the available amount of memory in bytes, not a percentage.
Depending on your ZFS implementation, please also consider the following:
For ZFS in Oracle Solaris 11, the limit has long been completely removed by default:
So while you CAN still set the limit, it is no longer recommended.
For ZFS on Linux up to 0.6.x, e.g. in Ubuntu 16.04 the default seems to be 75%:
There's also a tunable if you would like to make sure a minimum amount of memory is always reserved for meta data:
In ZFS on Linux 0.7.0, it seems like there will be a way to tune the amount of memory with a percentage limit:
If you're planning to use a Linux based ZFS implementation, before spending lots of $$$ on hardware, consider simulating your use case in a virtual machine. I would recommend testing the worst case for dedup (=100% random data). If you do not have the necessary virtualization resources at hand, be advised that you can always just spin up insanely huge instances on most cloud providers for a couple of hours for very little money.
One last thing to consider: You can always tune the ZFS recordsize. Generally speaking, small record sizes will yield better dedup ratios (but obviously require more RAM for the dedup table). Larger record sizes will yield worse dedup ratios, but require less RAM for the dedup table. E.g.: While we're currently not using dedup on our ZFS backup storage, I have set the ZFS recordsize to 1M to match the block size our backup application is working with.
Not sure why I just wrote a PHD thesis on the caching of ZFS meta data, but I hope it helps. :)