I think option 3 as you have described above is probably your best bet. The biggest problem with what you want is that ZFS really only handles this copy-on-write at the dataset/snapshot level.
I would strongly suggest avoiding using dedup unless you have verified that it works well with your exact environment. I have personal experience with dedup working great until one more user or VM store is moved in, and then it falls off a performance cliff and causes a lot of problems. Just because it looks like it's working great with your first ten users, your machine might fall over when you add the eleventh (or twelfth, or thirteenth, or whatever). If you want to go this route, make absolutely sure that you have a test environment that exactly mimics your production environment and that it works well in that environment.
Back to option 3, you'll need to set up a specific data set to hold each of the file system trees that you want to manage in this way. Once you've got it set up and initially populated, take your snapshots (one per dataset that will differ slightly) and promote then into clones. Never touch the original dataset again.
Yes, this solution has problems. I'm not saying it doesn't, but given the restrictions of ZFS, it's still probably the best one. I did find this reference to someone using clones effectively: http://thegreyblog.blogspot.com/2009/05/sparing-disk-space-with-zfs-clones.html
I'm not real familiar with btrfs, but if it supports the options that you want, have you considered setting up a separate server just to support these datasets, using Linux and btrfs on that server?
These look like generic hdd timeout errors in the guest system. They might be caused by ZFS, but they might just as well be caused by other high i/o operations. As a guest system, Linux is quite sensitive in this regard, as it has a low default timeout (usually 30 seconds). This may not be enough in a vm, especially if the disk image is a regular file and the host system is under load; some writes could take longer than expected if the host's cache is full.
Or, to quote the VirtualBox manual:
However, some guests (e.g. some Linux versions) have severe problems
if a write to an image file takes longer than about 15 seconds. Some
file systems however require more than a minute to complete a single
write, if the host cache contains a large amount of data that needs to
be written.
Note that this is not limited to VirtualBox. Other virtualization solutions may show the same behavior when running a Linux guest.
As for the timeout itself: The Linux hdd timeout (leading to ata exceptions and possibly corruption under high load) can be increased in the guest system.
For example, on Debian 7, all you need to do is add a few lines to your /etc/rc.local
:
$ cat /etc/rc.local
#!/bin/sh -e
#
# rc.local
#
# This script is executed at the end of each multiuser runlevel.
# Make sure that the script will "exit 0" on success or any other
# value on error.
#
# In order to enable or disable this script just change the execution
# bits.
#
# By default this script does nothing.
TIMEOUT=86400
for f in /sys/block/sd?/device/timeout; do
echo $TIMEOUT >"$f"
done
exit 0
Then grep for ata exceptions to see if they're gone:
# grep -Rn --col 'ata.*exception' /var/log/
However, it would be preferable to increase the vm's disk performance rather than having to change the timeout of the guest system. In the case of VirtualBox, the "Host I/O Cache" of the vm's virtual storage controller can be disabled. If enabled, the host cache could be the bottleneck and slow disk operations down if there's a lot of disk i/o on the host. On the other hand, disabling it might increase the load on the vm itself so timeouts might still occur if the guest is overloaded, so enabling the host cache might even be better in some cases, depending on your workload.
If this does not help, the VirtualBox manual also recommends experimenting with the flush interval:
For IDE disks use the following command:
VBoxManage setextradata "VM name"
"VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/FlushInterval" [b]
For SATA disks use the following command:
VBoxManage setextradata "VM name"
"VBoxInternal/Devices/ahci/0/LUN#[x]/Config/FlushInterval" [b]
Values between 1000000 and 10000000 (1 to 10 megabytes) are a good
starting point. Decreasing the interval both decreases the probability
of the problem and the write performance of the guest.
In some tests, VirtualBox guest systems have experienced such hdd timeouts (crashing the vm and/or causing corruption) no matter if host i/o caching was enabled or not. The host filesystem was not slow, except for half a minute whenever a scheduled cron job would run, causing those timeouts in the vm. It was only after setting the hdd timeout as described above that the issue went away and no more timeouts happened.
Best Answer
Several things are important here. Your errors stem from combinations of them:
send/recv
later on.@--head--
, which is the state of your source at the moment of sending (if you would have sent an existing snapshot, that snapshot would take the place of@--head--
on the destination side).The
send -R | recv -F
combination means full replication (recursive and including properties on source, destroy old stuff on destination), so you need to decide how to expand the file system hierarchy: you can use either-e
,-d
, or no flag on receiving (no flag means merging the contents under the new dataset without preserving the name of the parent dataset on the source side):So, to sum it up and apply to your specific situation:
First recursively create a current snapshot or choose an older one that contains all the older stuff you want to replicate):
Destroy any old snapshots on the destination side which are on the source side (show all snapshots with
zfs list -Hr -o name -t snap zfs/choang.assets
or take the hints from the error message). Alternatively destroy the destination dataset and recreate it if it does not contain anything of importance.Send recursively and fully receive, destroying all old datasets on the second dataset, merging the sub-datasets into the destination so that they mirror the source: