ZFS dedupe (again): Is memory usage dependent on physical (deduped, compressed) data stored or on logical used

memoryzfs

I've been googling this a lot, but I cannot get sufficient info on this one. The rule of thumb seems to be 5gb of RAM for 1TB of storage. But what is storage actually? Physical or logical used?

Let's say I have a 6TB harddrive, no dedupe, no compression. I have 6TB of actual data. Let's assume it'd dedupe 2:1, down to 3TB of data. Would we (approximately) require 3 * 5GB of memory, or 6 * 5GB?

As I understand it, it's dependent on a record. Since I cannot store more than 6TB of actual records on the disk, about 30GB ought to be enough, no matter compression / deduplication ratio, of course depending on the actual record sizes?

The thing is, that we'd like to calculate what's cheaper: Replace 6*6TB disks (3x onsite storage/mirror/hot spare, 3x offsite, we don't have more slots available in those boxes) with larger ones for backups, or buy some RAM for both boxes.

(Disclaimer: I'm not a sysadmin, but someone needed to put that hat on, so we can continue to have backups.)

Best Answer

While user121391's answer is mostly correct, the 1/4 limit for meta data is no longer the case/has not been the case for a long time:

There's a limit to how much of the ZFS ARC cache can be allocated for metadata (and the dedup table falls under this category), and it is capped at 1/4 the size of the ARC

First of all, the zfs_arc_meta_limit (the amount of caching memory that may be used for meta data, including the dedup table) has always been tunable (iirc). So even in very old ZFS versions where 25% might have been the default, you could use that setting to tune the amount of cache available for metadata. In case of a backup system where most of the user data is rarely accessed, >=75% for meta data + <=25% for user data might be more appropriate. Please keep in mind, that said tunable is the available amount of memory in bytes, not a percentage.

Depending on your ZFS implementation, please also consider the following:


For ZFS in Oracle Solaris 11, the limit has long been completely removed by default:

Prior to this change being implemented, the ARC limited metadata to one quarter of memory. Whatever the rationale for this might once have been it carries now a serious adverse effect on dedup performance. Because the DDT is considered to be metadata, it is subject to the 1/4 limit. At this point, this limit is an anachronism; it can be eliminated (or rather, set to arc_c).

So while you CAN still set the limit, it is no longer recommended.


For ZFS on Linux up to 0.6.x, e.g. in Ubuntu 16.04 the default seems to be 75%:

zfs_arc_meta_limit (ulong): The maximum allowed size in bytes that meta data buffers are allowed to consume in the ARC. When this limit is reached meta data buffers will be reclaimed even if the overall arc_c_max has not been reached. This value defaults to 0 which indicates that 3/4 of the ARC may be used for meta data.

There's also a tunable if you would like to make sure a minimum amount of memory is always reserved for meta data:

zfs_arc_meta_min (ulong): The minimum allowed size in bytes that meta data buffers may consume in the ARC. This value defaults to 0 which disables a floor on the amount of the ARC devoted meta data.

In ZFS on Linux 0.7.0, it seems like there will be a way to tune the amount of memory with a percentage limit:

zfs_arc_meta_limit_percent (ulong): Percentage of ARC buffers that can be used for meta data. See also zfs_arc_meta_limit which serves a similar purpose but has a higher priority if set to nonzero value.


If you're planning to use a Linux based ZFS implementation, before spending lots of $$$ on hardware, consider simulating your use case in a virtual machine. I would recommend testing the worst case for dedup (=100% random data). If you do not have the necessary virtualization resources at hand, be advised that you can always just spin up insanely huge instances on most cloud providers for a couple of hours for very little money.

One last thing to consider: You can always tune the ZFS recordsize. Generally speaking, small record sizes will yield better dedup ratios (but obviously require more RAM for the dedup table). Larger record sizes will yield worse dedup ratios, but require less RAM for the dedup table. E.g.: While we're currently not using dedup on our ZFS backup storage, I have set the ZFS recordsize to 1M to match the block size our backup application is working with.

Not sure why I just wrote a PHD thesis on the caching of ZFS meta data, but I hope it helps. :)

Related Question