Tiered storage with BTRFS – how is it done

btrfs

NETGEAR uses BTRFS in their ReadyNAS OS, and implements Tiered Storage in their latest versions. They started with "Metadata" tier only in ReadyNAS v6.9, and then added "Data Tier" in v6.10. The system uses SSDs as Tier-0 to speed up access to the slower HDDs in the system. The description of the system states that metadata will reside on the SSDs in both cases, and that in the "Data Tier" case also newly written data will go to the SSDs first, and then later will get migrated to HDD periodically, or when the SSD tier fills up to a specified level.

ReadyNAS uses BTRFS on top of RAID-ed HDDs in its normal installs – e.g. my system has a RAID5 made of 4 disks, which BTRFS sees/uses as a single device.

Looking at how the Tiering is implemented, it looks like both "Metadata" and "Data Tier" setups are done by adding a second RAID array, made only of the SSDs, to the main HDD RAID array, and transforming the initial single-device BTRFS into a multiple-device one.

What I cannot figure out is how the migration is done, and also how the "Metadata" case manages to separate metadata from data, so that only metadata goes to SSD? Also, how does the "Data Tier" mode direct the writes entirely to the SSD tier?

Any ideas?

Best Answer

OK, here's what I found happening during the periodic balances:

The following process is started on the host:

btrfs balance start -dsweep lt:/dev/md127:7 /data LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin DBUS_SESSION_BUS_ADDRESS=unix:path=/var/netatalk/spotlight.ipc TRACKER_USE_CONFIG_FILES=1 TRACKER_USE_LOG_FILES=1 XDG_DATA_HOME=/apps/.xdg/local/share XDG_CONFIG_HOME=/apps/.xdg/config XDG_CACHE_HOME=/apps/.xdg/cache

where /data is my tiered data-volume, /dev/md127 is the SSD array used as buffer/cache.

This process runs until the data from the SSD tier is moved almost completely to the HDD tier - e.g. somewhere along the way I see:

btrfs fi sh /data
Label: '0a44c6bc:data'  uuid: ed150b8f-c986-46d0-ada8-45ee219acbac
    Total devices 2 FS bytes used 393.14GiB
    devid    1 size 7.12TiB used 359.00GiB path /dev/md126
    devid    2 size 114.68GiB used 42.06GiB path /dev/md127

and then it goes down until the usage of the SSD tier goes almost to zero. The strange thing is that so far i was not able to run this command manually.

I still cannot figure out the 'sweep' balance filter.

This is what the -help shows:

# btrfs balance start --help
usage: btrfs balance start [options] <path>

    Balance chunks across the devices

    Balance and/or convert (change allocation profile of) chunks that
    passed all filters in a comma-separated list of filters for a
    particular chunk type.  If filter list is not given balance all
    chunks of that type.  In case none of the -d, -m or -s options is
    given balance all chunks in a filesystem. This is potentially
    long operation and the user is warned before this start, with
    a delay to stop it.

    -d[filters]    act on data chunks
    -m[filters]    act on metadata chunks
    -s[filters]    act on system chunks (only under -f)
    -v             be verbose
    -f             force reducing of metadata integrity
    --full-balance do not print warning and do not delay start
    --background|--bg
                   run the balance as a background process

but this does not explain how it maps to the "lt:/dev/md127:7" part of the command that runs periodically:

btrfs balance start -dsweep lt:/dev/md127:7 /data

What's the meaning here: Run until the /dev/md127 data usage falls below 7% ?!?