Linux – Combining SSD + HDD into single fast, large partition

linuxzfs

So I have a desktop with a fast SSD and large HDD. I am trying to get a well configured large, fast zpool out of it.

I have read that I can carve separate partitions into the SSD for the ZIL and L2ARC which would seem to do what I want, except I have to manually configure how big each partition should be. What I don't like about it is that it's somewhat involved, potentially hard to reconfigure if I need to change the partitions, and it sounds like the maximum filesystem size is limited by the HDD alone since the intent is that everything on the ZIL and L2ARC has to also make it to disk, at least eventually. Also it's not clear if the L2ARC is retained after system reboot or if it has to be populated again. It also seems inefficient to have to copy data from ZIL to L2ARC if they are both on the same SSD, or even to HDD if there is currently no pressure on how much hot data I need on SSD.

Alternatively, it seems I can also just have 1 partition on SSD and 1 on HDD and add them to a zpool directly with no redundancy. I have tried this, and noticed sustained read/write speeds greater than what HDD alone can muster. But I don't know if everything is just going to the SSD for now, and everything will go to HDD later once SSD is all filled up. Ideally, I would like to have ZFS transparently shuffle the data around behind the scenes to try to always keep the hot data on the SSD similarly to what L2ARC, and have a sensible amount of empty space on SSD for new writes. The ZIL should be automatically managed to be the right size and preferably live on the SSD as much as possible.
If I go the manually configured ZIL + L2ARC route, it seems like the ZIL only needs to be about (10 sec * HDD write speed) big. Doing this maximizes the size of L2ARC which is good. But what happens if I add a striped disk which effectively doubles my HDD speed (and capacity)?

Summary of questions if using SSD for ZIL + L2ARC:

If I set up SSD for ZIL + L2ARC, how hard is it to re-set it up with different partition sizes?
If I use SSD for L2ARC, is its capacity included in total available pool capacity, or is the pool capacity limited by HDD alone?
Is L2ARC retained after system reboot, or does it have to be re-populated?
Does data have to be copied from ZIL to L2ARC even if both are on same physical SSD?
If ZIL is on SSD and there is still plenty of room for more intents to be logged, does the ZIL still automatically get flushed to SSD? If so, when/under what circumstances?

Summary of questions if using SSD + HDD in a single zpool:

ZFS obviously notices the difference in size between SSD and HDD partitions, but does ZFS automatically recognize the relative performance of SSD and HDD partitions? In particular,
How are writes distributed across the SSD and HDD when both are relatively empty?
Does ZFS try to do anything smart with data shuffling once the SSD part of the zpool fill up? In particular,
If the SSD part of zpool is filled up, does ZFS ever anticipate that I will have more writes soon and tries to move data from SSD to HDD in the background?
If the SSD part of zpool is filled up, and I start accessing a bunch of data off HDD, and not so much off SSD, does ZFS make any effort to swap the hot data to SSD?

Finally, the most important question:

Is it a good idea to set up SSD + HDD in same pool, or is there a better way to optimize my pair of drives for both speed and capacity?

Best Answer

While Marco's answer explained all the details correctly, I just want to focus on your last question/summary:

Is it a good idea to set up SSD + HDD in same pool, or is there a better way to optimize my pair of drives for both speed and capacity?

ZFS is a file system designed for large arrays with many smaller disks. Although it is quite flexible, I think it is suboptimal for your current situation and goal, for the following reasons:

ZFS does no reshuffling of already written data. What you are looking for is called a hybrid drive, for example Apple's Fusion Drive allows to fuse multiple disks together and automatically selects the storage location for every block based on access history (moving data is done when there is no load on the system or on rewrite). With ZFS, you have none of that, neither automatically nor manually, your data stays were it was written initially (or is already marked for deletion).
With just a single disk, you give up on redundancy and self-healing. You still detect errors, but you do not use the full capabilities of the system.
Both disks in the same pool means even higher chance of data loss (this is RAID0 after all) or corruption, additionally your performance will be sub par because of the different drive sizes and drive speeds.
HDD+SLOG+L2ARC is a bit better, but you need a very good SSD (better two different like Marco said, but a NVMe SSD is a good and expensive compromise) and most of the space on it is wasted: 2 to 4 GB for the ZIL are enough, and a large L2ARC only helps if your RAM is full, but needs higher amounts of RAM itself. This leads to sort of catch-22 - if you want to use L2ARC, you need more RAM, but then you can just use the RAM itself, because it is enough. Remember, only blocks are stored, so you do need not as much as you would assume by looking at plain files.

Now, what are the alternatives?

You could split by having two pools. One for system, one for data. This way, you have no automatic rebalance and no redundancy, but a clean system which can be extended easily and which has no RAID0 problems.
Buy a second large HDD, make a mirror, use the SSD like you outlined: removes the problem of differently sized disks and disk speeds, gives you redundancy, keeps the SSD flexible.
Buy n SSDs and do RAIDZ1/2/3. Smaller SSDs are pretty cheap nowadays and do not suffer slow rebuild times, making RAIDZ1 interesting again.
Use another file system or volume manager with hybrid capabilities, ZFS on top if needed. This is not seen as optimal, but neither is working with two single disk vdevs in a pool... at least you get exactly what you want, and some nice things of ZFS (snapshots etc.) on top, but I wouldn't count on stellar performance.

Related Solutions

Freebsd – How full is the ZFS pool? (not sure how to read command output)

The two numbers are wildly different, and I don't understand how to reconcile them.

Hint appreciated please!

The zpool output is correct.

The other command you might be interested in is zfs list.

Freebsd – Replacing disk when using FreeBSD ZFS zroot (ZFS on partition)

First: remember to take the new drive offline and be sure that it's not mounted or in use in any way.

Copy partition tables from old disk ada0 to new disk ada3:

% doas gpart backup ada0 | doas gpart restore -F ada3

Now ada3 has same three partitions as ada0:

% doas gpart show ada3
=>        40  3907029088  ada3  GPT  (1.8T)
          40        1024     1  freebsd-boot  (512K)
        1064         984        - free -  (492K)
        2048     4194304     2  freebsd-swap  (2.0G)
     4196352  3902832640     3  freebsd-zfs  (1.8T)
  3907028992         136        - free -  (68K)

Remove old ZFS metadata (notice partition p3):

% doas dd if=/dev/zero of=/dev/ada3p3

Replace drive (notice partition p3):

% doas zpool replace -f zroot 15120424524672854601 /dev/ada3p3
Make sure to wait until resilver is done before rebooting.

If you boot from pool 'zroot', you may need to update
boot code on newly attached disk '/dev/ada3p3'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

        gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

Run the mentioned command to update boot information on the new disk:

% doas gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada3
partcode written to ada3p1
bootcode written to ada3

UUID's are now different:

% gpart list ada0 | grep uuid | sort
   rawuuid: 7f842536-bcd0-11e8-b271-00259014958c
   rawuuid: 7fbe27a9-bcd0-11e8-b271-00259014958c
   rawuuid: 7fe24f3e-bcd0-11e8-b271-00259014958c
% gpart list ada3 | grep uuid | sort
   rawuuid: 9c629875-c369-11e8-a2b0-00259014958c
   rawuuid: 9c63d063-c369-11e8-a2b0-00259014958c
   rawuuid: 9c66f76e-c369-11e8-a2b0-00259014958c
% gpart list ada0 | grep efimedia | sort
   efimedia: HD(1,GPT,7f842536-bcd0-11e8-b271-00259014958c,0x28,0x400)
   efimedia: HD(2,GPT,7fbe27a9-bcd0-11e8-b271-00259014958c,0x800,0x400000)
   efimedia: HD(3,GPT,7fe24f3e-bcd0-11e8-b271-00259014958c,0x400800,0xe8a08000)
% gpart list ada3 | grep efimedia | sort
   efimedia: HD(1,GPT,9c629875-c369-11e8-a2b0-00259014958c,0x28,0x400)
   efimedia: HD(2,GPT,9c63d063-c369-11e8-a2b0-00259014958c,0x800,0x400000)
   efimedia: HD(3,GPT,9c66f76e-c369-11e8-a2b0-00259014958c,0x400800,0xe8a08000)

Drive is now resilvering:

% zpool status zroot
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Sep 29 01:01:24 2018
        64.7G scanned out of 76.8G at 162M/s, 0h1m to go
        15.7G resilvered, 84.22% done
config:

        NAME                        STATE     READ WRITE CKSUM
        zroot                       DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            ada0p3                  ONLINE       0     0     0
            ada1p3                  ONLINE       0     0     0
            ada2p3                  ONLINE       0     0     0
            replacing-3             OFFLINE      0     0     0
              15120424524672854601  OFFLINE      0     0     0  was /dev/ada3p3/old
              ada3p3                ONLINE       0     0     0

After resilver:

% zpool status zroot
  pool: zroot
 state: ONLINE
  scan: resilvered 18.6G in 0h7m with 0 errors on Sat Sep 29 01:09:22 2018
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0

errors: No known data errors

Best Answer

Related Solutions

Freebsd – How full is the ZFS pool? (not sure how to read command output)

Freebsd – Replacing disk when using FreeBSD ZFS zroot (ZFS on partition)

Related Question