Linux – Recover/import ZFS pool from single old mirror

data-recoverylinuxzfs

I made a couple serious errors dealing with my ZFS pools the other day (while misreading some online advice in fixing an error) and accidentally "created over" an existing 2-drive mirrored pool named backup. (Yes, I used the -f option after it complained. And now I know never to do this again.)

In any case, I happen to have taken out a 3rd mirrored drive from the same pool a few months back, as it was getting old and I didn't want to wait for it to start to fail. So, I thought that I could swap this drive in and use it to restore the pool. (I'd just be missing out on the past few months of backups, which is mostly what this pool is used for.)

However, I don't seem to be able to import the pool with this single old drive. At first, I thought it might have had to do with the name conflict with the new backup pool I accidentally created (and then destroyed). But even when trying to import via GUID, I get nothing.

Here's the output from zdb -l /dev/sdb1 (which is the third drive)

------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'backup'
    state: 0
    txg: 0
    pool_guid: 3936176493905234028
    errata: 0
    hostid: 8323329
    hostname: [omitted]
    top_guid: 14695910886267065742
    guid: 17986383713788026938
    vdev_children: 1
    vdev_tree:
        type: 'mirror'
        id: 0
        guid: 14695910886267065742
        whole_disk: 0
        metaslab_array: 34
        metaslab_shift: 33
        ashift: 12
        asize: 1000197324800
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 17914838236907067293
            path: '/dev/sdd1'
            whole_disk: 0
            DTL: 143
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 17986383713788026938
            path: '/dev/sdb1'
            whole_disk: 0
            DTL: 141
        children[2]:
            type: 'disk'
            id: 2
            guid: 1683783279473519399
            path: '/dev/sdc1'
            whole_disk: 0
            DTL: 145
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
    create_txg: 0
    labels = 0 1 2 3

Thus, the drive and pool data on the drive seem to be intact, according to zdb. However, importing the pool (even with -f and/or -F) just gets an "cannot import… no such pool available" error. I tried using the various GUIDs in the above info too (since I wasn't sure with GUID was the relevant one), but none of those commands (e.g., zpool import 3936176493905234028) gets anything other than the "no such pool available" message.

I have installed a new version of my Linux OS since I removed that drive, so I thought using the old zpool.cache file I managed to recover from the old OS might do something. But the command zpool import -c zpool.cache just gives:

  pool: backup
     id: 3936176493905234028
  state: UNAVAIL
 status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: http://zfsonlinux.org/msg/ZFS-8000-5E
 config:

    backup      UNAVAIL  insufficient replicas
      mirror-0  UNAVAIL  insufficient replicas
        sdd1    FAULTED  corrupted data
        sdc1    FAULTED  corrupted data

Which is somewhat to be expected. Those are the two disks where the pool was overwritten by my create command. However, sdb1 isn't listed as a potential drive there — probably because I removed it from the pool after I took the disk out. Nevertheless, I think I have an intact copy of old mirrored data on sdb1, and zdb agrees. Why won't it import?

Any suggestions on what else to try? Other diagnostic commands to run?

Note: I tried asking about this over at Server Fault (see link for more details about my situation), but I didn't get any feedback and realized the specific Linux implementation may be important in figuring out how to resolve this. I would sincerely appreciate any advice or suggestions.

UPDATE: I think I may have found the problem. I thought that I had removed the spare drive before I had issued a detach command. And the fact that I was still seeing label information (when other online sources seem to indicate detach destroys the pool metadata) seemed to confirm that. I note that I'm able to simply type zdb -l backup and get label info (and get uberblock info with -u), so zfs seems to see the pool even without explicitly pointing to the device. It just doesn't want to import it for some reason.

However, I'm no longer certain about the detach status. I came upon this old thread about recovering a ZFS pool from a detached mirror, and it makes a cryptic reference to txg having a value of zero. There are also references elsewhere to uberblocks being zeroed out upon a detach.

Well, the uberblock to my backup pool does list txg = 0 (while an active zpool I have elsewhere has large numbers in this field, not zero). And while there is an existing uberblock, there's only one, with the others on backup listed as "invalid." Unfortunately, I can't seem to find much documentation of anything coming out of zdb easily available online.

I assume that means the spare third drive was detached? Can anyone confirm my interpretation? However, if the drive data is otherwise intact, is there any way to recover from it? While some advice online suggests a detached mirror is unrecoverable without resilvering, the thread I linked above has code for Solaris that seems to do a rather simple function to trick the label into thinking the uberblock is fine. Further poking around found me an updated Solaris version of this utility from only three years ago.

Assuming my understanding is correct and that my third mirror was detached, can I attempt a similar uberblock label fix in Linux? Is my only option to attempt to rewrite the Solaris code so that it ports to Linux? (I'm not sure I'm up to that.)

Honestly, given multiple references to scenarios like this online, I'm surprised at the lack of reasonable data recovery tools for ZFS. It seems there are finally some options for basic data recovery for common problems (including a possibility for recovering a pool that was written over by a create command; this doesn't appear to be likely to work for me), but other than this one-off script for Solaris, I don't see anything for dealing with detached devices. It's very frustrating to realize that there are at least a dozen reasons why ZFS pools may fail to import (sometimes for trivial things that could be easily recoverable), and little in the way of troubleshooting, proper error codes, or documentation.

Again, any help, thoughts, or suggestions would be appreciated. Even if someone could recommend a better place to ask about this, I'd really appreciate it.

UPDATE2: It's also possible the device was merely put offline as I think it was. I've read various threads that say offline devices can end up unimportable as single mirrors too. And as the metadata and zdb output is poorly documented for ZFS, I really don't know how to determine what the uberblock and label data mean without reading through thousands of lines of source code.

Best Answer

Well, I'm close and think I've found a path to recovery. Since I haven't received advice from others, I'll post what I've learned so far.

Summary:

There is an unmaintained, not-officially-supported labelfix utility for fixing labels on certain kinds of corrupted (and offline/detached) ZFS volumes, which can be used to make an unimportable pool importable.
Before doing anything, be sure to clone the old spare devices and only work on the clones.
If you have a situation as described in the question with two pools of the same name (due to a mistaken create or other error), make sure only devices are plugged in with the specific pool you want to recover.
Also, remove any devices that may have ever been associated with the pool you want to recover but which are faulted. (That applies even if you think you've completely destroyed any other pools and disassociated these devices. The recovery tools will try to piece together fragments of old pools, and may read old labels/uberblocks to combine devices and data in unpredictable ways.)

More details:

There appears to be a way to recover offline and detached drives from zpools on Linux. A user jjwhitney created a port of the labelfix utility I mentioned in the question, originally written by Jeff Bonwick (the inventor of ZFS) nearly 12 years ago. For reasons I cannot fathom, this utility has not been incorporated into ZFS builds, even though it would allow recovery of data for intact pools when an import fails for a number of reasons due to invalid labels. (Some discussion on the issue here.)

(Sidenote: One thing this process has led me to realize is that ZFS recovery tools are severely lacking, and no one should be using this filesystem for anything without complete backups of data running all the time. And don't depend on that old mirrored drive sitting in your closet to be a last-chance backup unless you're sure it's importable. ZFS is apparently great at maintaining data integrity when ZFS cooperates, but incredibly fragile. And when it breaks -- or you do something minor but stupid -- your data can simply all be inaccessible and unreadable, even if intact.)

In any case, the labelfix utility hasn't been updated in 5 years, so it doesn't compile with modern ZFS library files. Luckily, I had the original OS version still installed and could boot to that, then download an old ZFS on Linux source tarball and use that to get the appropriate ZFS libraries and build environment on a system where it all still works. (I started tweaking the labelfix utility to try to work with modern ZFS libraries, but that seemed a little dangerous given how little I understand about all the internals I'd need to fix to correspond to the current codebase. Easier to just build it on an old version.)

And lo, labelfix immediately and easily rewrote the label on my device to something that zpool import could at least interpret!

I should say that I used ddrescue to copy the whole thing from the original drive before attempting any of this. And I would highly recommend that, as it's possible to make mistakes, as I did. The original pool I accidentally wrote over was named backup, so zdb started seeing multiple versions of the different backup pools and couldn't figure out why all the metadata didn't match. I had to tweak vdev_validate_skip=1 in the ZFS kernel module to get the pool to import, but that then just imported the newer backup pool (not the one I wanted). Note that this happened even though I specified the exact path to the drive I wanted to import from: when forcing imports with this method, it seemed to completely ignore my specification and draw on a completely different configuration from a device that wasn't listed in the command.

Luckily, I had made another clone of the drive, so I could attempt another run. However, labelfix also is smart and seems to read the current drive configuration, so it picked up on the fact that I had two old drives with "corrupted data" from the first backup pool. The corruption unfortunately meant that the "fixed" label listed the pool not only as DEGRADED but also FAULTED and thus un-import-able.

At this point I realized I simply had to unplug all the old drives and work without them in the system at all to avoid corrupting recovery attempts. Unfortunately, labelfix only seems to fix things once, so I am now on to clone #3 of this drive (which is currently copying from my first backup clone). Once that cloning process finishes, I'll run labelfix without any of the other old drives present, and hopefully I'll get a DEGRADED pool that I can then import.

Related Solutions

Linux – How to replace a disk in a non-redundant ZFS pool

Don't know if things were that different in `13 but 'zfs replace' works on non-redundant pools. You just run the 1 command instead of detaching first.

d1 is 1G, d2 is 2G, both are empty files in /tmp:

/t/test #> zpool create test /tmp/test/d1
/t/test #> zpool set autoexpand=on test
/t/test #> zpool status
  pool: test
 state: ONLINE
  scan: none requested
config:

    NAME            STATE     READ WRITE CKSUM
    test            ONLINE       0     0     0
      /tmp/test/d1  ONLINE       0     0     0

errors: No known data errors
/t/test #> zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
test   1008M   258K  1008M         -     0%     0%  1.00x  ONLINE  -


/t/test #> zpool replace test /tmp/test/d1 /tmp/test/d2


/t/test #> zpool status
  pool: test
 state: ONLINE
  scan: resilvered 61K in 0h0m with 0 errors on Sun Sep 18 18:55:32 2016
config:

    NAME            STATE     READ WRITE CKSUM
    test            ONLINE       0     0     0
      /tmp/test/d2  ONLINE       0     0     0

errors: No known data errors
/t/test #> zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
test   1.98G   408K  1.98G         -     0%     0%  1.00x  ONLINE  -

What’s an effective offsite backup strategy for a ZFS mirrored pool

After much tinkering and experimentation I've found a solution, albeit with a fairly large tradeoff.

First off, the options I had to rule out:

Having a second offsite ZFS server with a mirrored pool wasn't an option due to cost. Had it been an option this would by far have been the best approach, utilizing ZFS send / receive to ship snapshots to the remote pool.
Having a second onsite ZFS mirrored pool, which I could remove disks from to take home. This is more feasible than the first option, but I would need the second pool to always have two disks onsite (or to use two data-copies on a single onsite disk). At present I have four disks, and no more space for a fifth in the server. This would be a fair approach but still not ideal.
Using ZFS attach and detach to rotate the backup disk into and out of the mirrored pool. This works well, but has to perform a full resilver every time the disk is added. This takes unacceptably long, and so I couldn't rely on this.

My solution is similar to using attach and detach, however it uses online and offline. This has the advantage of performing a delta resilvering versus a full resilvering, but the drawback that the pool always reports a DEGRADED state (the pool always has two disks; the rotating offsite disks are marked offline when they are in remote storage and resilver and then come online when they are onsite).

So, a quick recap and overview of my setup:

I have one ZFS server and four identical disks. ZFS is setup to use a mirrored pool. Two of the four disks are permanent members of this pool. The other two disks rotate; one is always in offsite storage, the other is part of the pool to act as a ready-to-go backup.

When it comes time to rotate the backups:

I wait for a zfs scrub to complete to reasonably assure the backup disk is error free
I zfs offline the disk which will be taken remote. After its offline'd I hdparm -Y /dev/id to spin it down. After a minute I partially remove the disk sled (just enough to ensure its lost power) and then give it another minute before fully pulling the drive to guarantee it has stopped spinning. The disk goes in a static bag and then a protective case and goes offsite.
I bring in the other offsite disk. It gets installed in the hotswap tray and spins up. I use zfs online to restore the disk to the pool and kick off a partial resilvering to make it concurrent.

This system guarantees that at any given time I have two ONLINE mirror disks and one OFFLINE remote disk (which has been scrubbed). The fourth disk is either being resilvered or online, which has the benefit that in case a running drive fails it's probably the pool will still consistent of two online disks.

It's worked well for the past couple weeks, but I'd still consider this a hackish approach. I'll follow up if I run into any major issues.

Update: After running with this for a couple months I've found that in my real-world use the resilvering is taking the same time for either detach/attach and offline/online. In my testing I don't think I was running a scrub--my hunch is that if a drive is offline for a scrub then it requires a full resilver.

Best Answer

Related Solutions

Linux – How to replace a disk in a non-redundant ZFS pool

What’s an effective offsite backup strategy for a ZFS mirrored pool

Related Question