Linux – Unable to zfs send | zfs receive datasets in same zpool

linuxsnapshotzfs

I have two environments Staging (staging) and Development (choang). I am unable to replicate (including snapshots) from dataset zfs/staging.assets to another dataset zfs/choang.assets within the same zpool zfs.

NOTE: I am assuming I need to unmount both origin and destination datasets.

zfs unmount zfs/staging.assets
zfs unmount zfs/choang.assets

zfs send -R zfs/staging.assets | zfs receive -F zfs/choang.assets

When executed, the above commands generate the following error:

Error: Unsupported flag with filesystem or bookmark.
cannot receive: failed to read from stream

When I remove the -R option and execute the commands, it succeeds:

zfs send zfs/staging.assets | zfs receive -F  zfs/choang.assets

However, no snapshots are received and a single snapshot zfs/choang.assets@--head-- is created.

Finally, I attempted to send a snapshot– thinking maybe I might send a snapshot at a time:

zfs send zfs/staging.assets@sha512_hash | zfs receive -Fduv zfs/choang.assets

This didn't work either and generated the following error:

internal error: Invalid argument
cannot receive: failed to read from stream

How can I replicate all snapshots?

Best Answer

Several things are important here. Your errors stem from combinations of them:

Normally you send a specific snapshot or several snapshots instead of a whole file system. This means you do not need to unmount the datasets and disrupt your users, and you can incrementally send/recv later on.
If you do not specify a snapshot on the source, you will get the automatically generated snapshot @--head--, which is the state of your source at the moment of sending (if you would have sent an existing snapshot, that snapshot would take the place of @--head-- on the destination side).

The send -R | recv -F combination means full replication (recursive and including properties on source, destroy old stuff on destination), so you need to decide how to expand the file system hierarchy: you can use either -e, -d, or no flag on receiving (no flag means merging the contents under the new dataset without preserving the name of the parent dataset on the source side):

The -d and -e options cause the file system name of the target snapshot
to be determined by appending a portion of the sent snapshot's name to
the specified target filesystem.  If the -d option is specified, all
but the first element of the sent snapshot's file system path (usually
the pool name) is used and any required intermediate file systems
within the specified one are created. If the -e option is specified,
then only the last element of the sent snapshot's file system name
(i.e. the name of the source file system itself) is used as the target
file system name.

Your last idea (single send, full receive) should work (I tested it in a simple environment and it did work), but it would not be what you wanted anyway.

So, to sum it up and apply to your specific situation:

First recursively create a current snapshot or choose an older one that contains all the older stuff you want to replicate):
```
zfs snapshot -r zfs/staging.assets@now
```
Destroy any old snapshots on the destination side which are on the source side (show all snapshots with zfs list -Hr -o name -t snap zfs/choang.assets or take the hints from the error message). Alternatively destroy the destination dataset and recreate it if it does not contain anything of importance.
Send recursively and fully receive, destroying all old datasets on the second dataset, merging the sub-datasets into the destination so that they mirror the source:
```
zfs send -R zfs/staging.assets@now | zfs recv -Fu zfs/choang.assets
```

Related Solutions

Freebsd – way to create cow-copies in ZFS

I think option 3 as you have described above is probably your best bet. The biggest problem with what you want is that ZFS really only handles this copy-on-write at the dataset/snapshot level.

I would strongly suggest avoiding using dedup unless you have verified that it works well with your exact environment. I have personal experience with dedup working great until one more user or VM store is moved in, and then it falls off a performance cliff and causes a lot of problems. Just because it looks like it's working great with your first ten users, your machine might fall over when you add the eleventh (or twelfth, or thirteenth, or whatever). If you want to go this route, make absolutely sure that you have a test environment that exactly mimics your production environment and that it works well in that environment.

Back to option 3, you'll need to set up a specific data set to hold each of the file system trees that you want to manage in this way. Once you've got it set up and initially populated, take your snapshots (one per dataset that will differ slightly) and promote then into clones. Never touch the original dataset again.

Yes, this solution has problems. I'm not saying it doesn't, but given the restrictions of ZFS, it's still probably the best one. I did find this reference to someone using clones effectively: http://thegreyblog.blogspot.com/2009/05/sparing-disk-space-with-zfs-clones.html

I'm not real familiar with btrfs, but if it supports the options that you want, have you considered setting up a separate server just to support these datasets, using Linux and btrfs on that server?

Linux – Does ZFS for Linux over stress VirtualBox

These look like generic hdd timeout errors in the guest system. They might be caused by ZFS, but they might just as well be caused by other high i/o operations. As a guest system, Linux is quite sensitive in this regard, as it has a low default timeout (usually 30 seconds). This may not be enough in a vm, especially if the disk image is a regular file and the host system is under load; some writes could take longer than expected if the host's cache is full.

Or, to quote the VirtualBox manual:

However, some guests (e.g. some Linux versions) have severe problems if a write to an image file takes longer than about 15 seconds. Some file systems however require more than a minute to complete a single write, if the host cache contains a large amount of data that needs to be written.

Note that this is not limited to VirtualBox. Other virtualization solutions may show the same behavior when running a Linux guest.

As for the timeout itself: The Linux hdd timeout (leading to ata exceptions and possibly corruption under high load) can be increased in the guest system.

For example, on Debian 7, all you need to do is add a few lines to your /etc/rc.local:

$ cat /etc/rc.local 
#!/bin/sh -e
#
# rc.local
#
# This script is executed at the end of each multiuser runlevel.
# Make sure that the script will "exit 0" on success or any other
# value on error.
#
# In order to enable or disable this script just change the execution
# bits.
#
# By default this script does nothing.

TIMEOUT=86400
for f in /sys/block/sd?/device/timeout; do
    echo $TIMEOUT >"$f"
done

exit 0

Then grep for ata exceptions to see if they're gone:

# grep -Rn --col 'ata.*exception' /var/log/

However, it would be preferable to increase the vm's disk performance rather than having to change the timeout of the guest system. In the case of VirtualBox, the "Host I/O Cache" of the vm's virtual storage controller can be disabled. If enabled, the host cache could be the bottleneck and slow disk operations down if there's a lot of disk i/o on the host. On the other hand, disabling it might increase the load on the vm itself so timeouts might still occur if the guest is overloaded, so enabling the host cache might even be better in some cases, depending on your workload.

If this does not help, the VirtualBox manual also recommends experimenting with the flush interval:

For IDE disks use the following command:

VBoxManage setextradata "VM name"
  "VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/FlushInterval" [b]

For SATA disks use the following command:

VBoxManage setextradata "VM name"
  "VBoxInternal/Devices/ahci/0/LUN#[x]/Config/FlushInterval" [b]

Values between 1000000 and 10000000 (1 to 10 megabytes) are a good starting point. Decreasing the interval both decreases the probability of the problem and the write performance of the guest.

In some tests, VirtualBox guest systems have experienced such hdd timeouts (crashing the vm and/or causing corruption) no matter if host i/o caching was enabled or not. The host filesystem was not slow, except for half a minute whenever a scheduled cron job would run, causing those timeouts in the vm. It was only after setting the hdd timeout as described above that the issue went away and no more timeouts happened.

Best Answer

Related Solutions

Freebsd – way to create cow-copies in ZFS

Linux – Does ZFS for Linux over stress VirtualBox

Related Question