LVM not coming up after reboot, couldn’t find device with uuid

data-recoveryhard-disklvm

Had a VM that was, up until recently working without issue, but needed to be rebooted after some configuration changes. However after rebooting the VM didn't come back up, saying it couldn't find the root device (which was an LVM volume under /dev/mapper).

Booting into recovery mode, I saw that the filesystems under /dev/mapper, and /dev/dm-* did indeed, not exist.

The filesystem should be layed out with

/dev/sda1 as the boot partition
/dev/sda2 extended partition containing
/dev/sda5 and /dev/sda6 as LVM partitions
/dev/sda{5,6} are both PVs in a single VG
with 2 LVs for the root FS and swap

Doing an lvm pvshow gives me:

  Couldn't find device with uuid '8x38hf-mzd7-xTes-y6IV-xRMr-qrNP-0dNnLi'.
  Couldn't find device with uuid '8x38hf-mzd7-xTes-y6IV-xRMr-qrNP-0dNnLi'.
  Couldn't find device with uuid '8x38hf-mzd7-xTes-y6IV-xRMr-qrNP-0dNnLi'.
  --- Physical volume ---
  PV Name               unknown device
  VG Name               of1-server-lucid
  PV Size               19.76 GiB / not usable 2.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              5058
  Free PE               0
  Allocated PE          5058
  PV UUID               8x38hf-mzd7-xTes-y6IV-xRMr-qrNP-0dNnLi

  --- Physical volume ---
  PV Name               /dev/sda6
  VG Name               of1-server-lucid
  PV Size               100.00 GiB / not usable 2.66 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              25599
  Free PE               0
  Allocated PE          25599
  PV UUID               cuhP6R-QbiO-U7ye-WvXN-ZNq5-cqUs-VVZpux

So it appears as though /dev/sda5 is not listed as a PV and is causing errors.

fdisk -l:

Disk /dev/sda: 128.8 GB, 128849018880 bytes
255 heads, 63 sectors/track, 15665 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00044a6c

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          32      248832   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              32       15665   125579256+   5  Extended
/dev/sda5              32        2611    20722970   8e  Linux LVM
/dev/sda6            2612       15665   104856223+  8e  Linux LVM

So I can see the /dev/sda5 device exists, but blkid isn't reporting anything for it:

~ # blkid
/dev/sda1: UUID="d997d281-2909-41d3-a835-dba400e7ceec" TYPE="ext2" 
/dev/sda6: UUID="cuhP6R-QbiO-U7ye-WvXN-ZNq5-cqUs-VVZpux" TYPE="LVM2_member"

After taking a snapshot of the disks, I tried recovering the PV from the archive config:

~ # pvremove -ff /dev/sda5
Labels on physical volume "/dev/sda5" successfully wiped
~ # pvcreate --uuid=8x38hf-mzd7-xTes-y6IV-xRMr-qrNP-0dNnLi /dev/sda5 --restorefile=/etc/lvm/archive/of1-dev-server_00000.vg
Couldn't find device with uuid '8x38hf-mzd7-xTes-y6IV-xRMr-qrNP-0dNnLi'.
  Physical volume "/dev/sda5" successfully created
~ # vgchange -a y
2 logical volume(s) in volume group "of1-dev-server" now active"

So at least now the device has a blkid:

/dev/sda1: UUID="d997d281-2909-41d3-a835-dba400e7ceec" TYPE="ext2" 
/dev/sda6: UUID="cuhP6R-QbiO-U7ye-WvXN-ZNq5-cqUs-VVZpux" TYPE="LVM2_member" 
/dev/sda5: UUID="8x38hf-mzd7-xTes-y6IV-xRMr-qrNP-0dNnLi" TYPE="LVM2_member"

Doing a pvdisplay now also shows the correct device:

  --- Physical volume ---
  PV Name               /dev/sda5
  VG Name               of1-dev-danr-lucid
  PV Size               19.76 GiB / not usable 2.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              5058
  Free PE               0
  Allocated PE          5058
  PV UUID               8x38hf-mzd7-xTes-y6IV-xRMr-qrNP-0dNnLi

  --- Physical volume ---
  PV Name               /dev/sda6
  VG Name               of1-dev-danr-lucid
  PV Size               100.00 GiB / not usable 2.66 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              25599
  Free PE               0
  Allocated PE          25599
  PV UUID               cuhP6R-QbiO-U7ye-WvXN-ZNq5-cqUs-VVZpux

And the mapper devices exist:

crw-rw----    1 root     root      10,  59 Jul 10 10:47 control
brw-rw----    1 root     root     252,   0 Jul 10 11:21 of1--dev--server-root
brw-rw----    1 root     root     252,   1 Jul 10 11:21 of1--dev--server-swap_1

Also the LVMs seem to be listed correctly:

~ # lvdisplay
  --- Logical volume ---
  LV Name                /dev/of1-dev-danr-lucid/root
  VG Name                of1-dev-danr-lucid
  LV UUID                pioKjE-iJEp-Uf9S-0MxQ-UR0H-cG9m-5mLJm7
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                118.89 GiB
  Current LE             30435
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:0

  --- Logical volume ---
  LV Name                /dev/of1-dev-danr-lucid/swap_1
  VG Name                of1-dev-danr-lucid
  LV UUID                mIq22L-RHi4-tudV-G6nP-T1e6-UQcS-B9hYUF
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                888.00 MiB
  Current LE             222
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:1

But trying to mount the root device gives me an error:

~ # mount /dev/mapper/of1--dev--server-root /mnt2
mount: mounting /dev/mapper/of1--dev--server-root on /mnt2 failed: Invalid argument

So I tried a disk consistency check:

~ # fsck.ext4 -f /dev/mapper/of1--dev--server-root
e2fsck: Superblock invalid, trying backup blocks...
e2fsck: Bad magic number in super-block while trying to open /dev/mapper/of1--dev--server-root
[...]

So I tried another superblock:

~ # mke2fs -n /dev/mapper/of1--dev--server-root
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
7798784 inodes, 31165440 blocks
1558272 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
952 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
        4096000, 7962624, 11239424, 20480000, 23887872
~ # fsck.ext4 -y -b 23887872 /dev/mapper/of1--dev--server-root

Upon which I received ridiculous numbers of errors, the main ones I saw were:

Superblock has an invalid journal
One or more block group descriptor checksums are invalid.
Truncating orphaned inode ()
Already cleared block #0 () found in orphaned inode
/dev/mapper/of1–dev–server-root contains a filesystem with errors, check forced
Resize inode not valid. Recreate
Root inode is not a directory.
Reserved inode 3 () has invalid mode
HTREE directory inode has invalid root node
Inode , i_blocks is , should be 0.
Unconnected directory inode

After a lot of messages, it says it's done. Mounting the directory as above works fine, but the directory is empty with a lost+found directory full of files, most just numbers, some have filenames vaguely relating to files that once existed.

So, how do I bring the VM back up?

Whenever I see disk errors, my first instinct is to snapshot so things don't get worse, so I have a snapshot from just after reboot when I first saw the error.

I know the data is there somewhere, as the VM worked without issue until I rebooted. The user can't remember changing anything on the filesystem recently, but it had almost a year of uptime when I rebooted it so all sorts could have happened since then.

We also, unfortunately, don't have backups as Puppet had been disabled on this node.

The original OS was Ubuntu Lucid, running on VMWare.

Best Answer

If I understood correctly, you have already fixed the volume, even though you have a lost+found directory which may or may not have critical files.

What is going on now that's blocking the VM from booting? It still can't find the boot device?

Your fdisk -l output seems a bit off to me. Have you considered the possibility that only the partition table was damaged? In this scenario, your snapshot may be helpful, and in the best case you won't even need a(nother) fsck. But we'll need something to try to find the partition offsets - I've used testdisk successfully more than once.

In the worst case scenario, if you need to scrape anything from the volume, forensic tools like PhotoRec or Autopsy/The Sleuth Kit may prove useful.

If none of this works, give us a lsblk -o NAME,RM,SIZE,RO,TYPE,MAJ:MIN -fat too (these flags are just to show as much information as possible), and relevant dmesg output, if any.

Related Solutions

LVM Bad Blocks – How to Check for Bad Blocks on an LVM Physical Volume

When you're using ext4, you can check for badblocks with the command e2fsck -c /dev/sda1 or whatever. This will "blacklist" the blocks by adding them to the bad block inode.

e2fsck -c runs badblocks on the underlying hard disk. You can use the badblocks command directly on a LVM physical volume (assuming that the PV is in fact a hard disk, and not some other kind of virtual device like an MD software RAID device), just as you would use that command on a hard disk that contains an ext file system.

That won't add any kind of bad block information to the file system, but I don't really think that that's a useful feature of the file system; the hard disk is supposed to handle bad blocks.

Even better than badblocks is running a SMART selftest on the disk (replace /dev/sdX with the device name of your hard disk):

smartctl -t long /dev/sdX
smartctl -a /dev/sdX | less

The test ifself will take a few hours (it will tell you exactly how long). When it's done, you can query the result with smartctl -a, look for the self-test log. If it says "Completed successfully", your hard disk is fine.

In other words, how can I check for bad blocks to not use in LVM?

As I said, the hard disk itself will ensure that it doesn't use damaged blocks and it will also relocate data from those blocks; that's not something that the file system or the LV has to do. On the other hand, when your hard disk has more than just a few bad blocks, you don't want something that relocates them, but you want to replace the whole hard disk because it is failing.

Linux – LVM: PV missing after reboot

Does the LV become mountable if you do a sudo vgscan and sudo vgchange -ay? If those commands result in errors, you probably have a different problem and should probably add those error messages in your original post.

But if the LV becomes ready for mounting after those commands, read on...

The LVM logical volume pathname (e.g. /dev/mapper/vgNAME-lvNAME) in /etc/fstab alone won't give the system a clue that this particular filesystem cannot be mounted until networking and iSCSI have been activated.

Without that clue, the system will assume that filesystem is on a local disk and will attempt to mount it as early as possible, normally before networking has been activated, which will obviously fail with an iSCSI LUN. So you'll need to supply that clue somehow.

One way would be to add _netdev to the mount options for that filesystem in /etc/fstab. From this Ubuntu help page it appears to be supported on Ubuntu. This might actually also trigger a vgscan or similar detection of new LVM PVs (+ possibly other helpful stuff) just before the attempt to mount any filesystems marked with _netdev.

Another way would be to use the systemd-specific mount option x-systemd.requires=<iSCSI initiator unit name>. That should achieve the same thing, by postponing any attempts to mount that filesystem until the iSCSI initiator has been successfully activated.

When the iSCSI initiator activates, it will automatically make any configured LUNs available, and as they become available, LVM should auto-activate any VGs on them. So, once you get the mount attempt postponed, that should be enough.

The lack of PARTUUID is a clue that the disk/LUN does not have a GPT partition table. Since /dev/sdc is listed as TYPE="LVM2_member" it actually does not have any partition table at all. In theory, it should cause no problems for Linux, but I haven't personally tested an Ubuntu 18.04 system with iSCSI storage, so cannot be absolutely certain.

The problem with disks/LUNs with no partition table is that other operating systems won't recognize the Linux LVM header as a sign that the disk is in use, and will happily overwrite it with minimal prompting. If your iSCSI storage administrator has accidentally presented the storage LUN corresponding to your /dev/sdc to another system, this might have happened.

You should find the LVM configuration backup file in /etc/lvm/backup directory that corresponds to your missing VG, and read it to find the expected UUID of the missing PV. If it matches what blkid reports, ask your storage administrator to double-check his/her recent work for mistakes like described above. If it turns out the PV has been overwritten by some other system, any remaining data on the LUN is likely to be more or less corrupted and it would be best to restore it from backup... once you get a new, guaranteed-unconflicted LUN from your iSCSI admin.

If it turns out the actual UUID of /dev/sdc is different from expected, someone might have accidentally run a pvcreate -f /dev/sdc somehow. If that's the only thing that has been done, that's relatively easy to fix. (NOTE: check man vgcfgrestore chapter REPLACING PHYSICAL VOLUMES for updated instructions - your LVM tools may be newer than mine.) First restore the UUID:

pvcreate --restorefile /etc/lvm/backup/<your VG backup file> --uuid <the old UUID of /dev/sdc from the backup file> /dev/sdc

Then restore the VG configuration:

vgcfgrestore --file /etc/lvm/backup/<your VG backup file> <name of the missing VG>

After this, it should be possible to activate the VG, and if no other damage has been done, mount the filesystem after that.

Best Answer

Related Solutions

LVM Bad Blocks – How to Check for Bad Blocks on an LVM Physical Volume

Linux – LVM: PV missing after reboot

Related Question