Ubuntu – Persistence of home directory after deleting all partitions and reinstalling:

homemdadmsystem-installationUbuntu

I was running Ubuntu Server on a mdadm RAID1 volume, and I ran into some problems so I decided to just do a clean install.

I booted with Gparted, stopped the mdadm array, and ran --zero-superblock on each partition, then deleted the partitions using Gparted.
I then booted from an Ubuntu Server USB, configured the raid from scratch using the manual partitioning option, and installed Ubuntu 18.04.
First thing after installing the OS, I installed xfce4 and firefox along with a few other standard utilities and rebooted ran startx and fired up Firefox.

When logged I logged in, the tabs that I had open before reinstallation immediately opened up, I was still logged into SO. In fact, my entire home folder was still there.

I thought I had nuked everything and started fresh. I am totally mystified as to how this happened. Upon further inspection, I saw that all my files in my home folder were still there as well.

Is there some step that I missed? I thought the point of –zero-superblock and deleting the partitions was to get rid of any data on the drive. What did I miss? What does it take to do a guaranteed truly clean install? I am certain I have followed the same exact procedure before and my home directory did not persist.

Edit: So I zeroed out the component drives of the array, but now in the manual setup in the installer, I cannot set my new partition to "bootable" (nothing changes when I select it). Oddly, when creating the partitions it no longer asks me if I want to do a primary or a logical partition as it has in past times USING this same installer usb.

Best Answer

The partition table is really just a piece of data that says things like

Partition 1 starts at track 10 and finishes at track 99
Partition 2 starts at track 100 and finishes at track 599
Partition 3 starts at track 600 and finishes at track 16383

If you delete the partition table then the data in tracks 10->99, 100->599, 600->16383 are untouched, just the OS no longer knows how to find it. So if you then recreate the paritition table exactly the same way then your data is still available. I made use of this in 2016 when I destroyed all my partition tables by mistake ( https://www.sweharris.org/post/2016-02-10-break-mbr/ )

If you want to delete the data inside the partitions as well then you either need to zero the whole disk, or else format the partitions. Most installers have an option to say "format partition" when you do your setup.

Or make sure your partitions start in different places (eg 9->100, 101->600, 601->16384) so the data inside doesn't look like a filesystem.

In response to Update 1

That "not enough to start the array" is never a good message to get from mdadm. What it means is that mdadm has found 10 drives out of your 12-drive RAID5 array, and as I hope you're aware RAID5 can only survive one failure, not two.

Well, let's try and piece together what happened. First, over reboot, there was a drive letter change, which is annoying for us trying to figure it out, but mdraid doesn't care about that. Reading through your mdadm output, here is the remap that happened (sorted by the raid disk #):

00 sdh1 -> sdb1
02 sdk1 -> sde1 [OUTDATED]
03 sdg1 -> sda1
04 sdf1 -> sdm1
05 sdd1 -> sdk1
06 sdm1 -> sdg1
07 sdc1 -> sdj1
08 sdi1 -> sdc1
09 sde1 -> sdl1
10 sdj1 -> sdd1
11 sdl1 -> sdf1
13 sdb1 -> sdi1 [SPARE]

#02 has a lower 'events' counter than the others. That means it left the array at some point.

It'd be nice if you know some of the history of this array—e.g., is "12-drive RAID5, 1 hot spare" correct?

I'm not quite sure what the sequence of failures that lead up to this is, though. It appears that at some point, device #1 failed, and a rebuild onto device #12 started.

But I can't make out exactly what happened next. Maybe you have logs—or an administrator to ask. Here is what I can't explain:

Somehow, #12 became #13. Somehow, #2 became #12.

So, that rebuild onto #12 should have finished and then #12 would be #1. Maybe it didn't—maybe it failed to rebuild for some reason. Then maybe #2 failed—or maybe #2 failed, is why the rebuild didn't finish, and someone tried removing and re-adding #2? That might make it #12. Then maybe removed and re-added the spare, making it #13.

Ok, but of course, at this point, you'd had a two-disk failure. Ok. That makes sense.

If this is what has happened, you've suffered a two-disk failure. That means you've lost data. What you do next, depends on how important that data is (considering also how good your backups are).

If the data is very valuable (and you don't have good backups), contact data recovery specialists. Otherwise:

If the data is valuable enough, you should use dd to image all the disks involved (you can use larger disks, and files on each to save money. 2 or 3 TB externals, for example). Then make a copy of the images. Then work on recovering on that copy (you can use loop devices to do this).

Obtain more spares. Probably, you have one dead disk. You have at least a few questionable disks—smartctl may be able to tell you more.

Next --force to your --assemble line. This will make mdadm use the outdated disk anyway. This means some sectors will now have outdated data, some won't. Add in one of those new disks as a spare, let the rebuild finish. Hopefully you don't hit any bad blocks (which would cause the rebuild to fail, and I believe the only answer is to make the disk map them out) Next, fsck -f the disk. There will probably be errors. Once they're fixed, mount the disk, and see what shape your data is in.

Recommendations

In the future, do not build 12-disk RAID5s. The probability of two-disk failure is too high. Use RAID6 or RAID10 instead. Also, make sure to routinely scrub your arrays for bad blocks (echo check > /sys/block/md0/md0/sync_action).

Best Answer

Related Solutions

MDADM – how to reassemble RAID-5 (reporting device or resource busy)

In response to Update 1

Recommendations

Related Question