6 min read

A Failed Data Recovery Adventure

A Failed Data Recovery Adventure
Photo by Frank R / Unsplash

[This is a post from my old blog posted on 2024-10-07]


For a few months I have been running my second hand HPE ProLiant DL360 Gen9 server. I had it setup with 3 HDD’s. Two 1.2TB HDD’s and one 1.8TB HDD. The two 1.2TB disks were setup in raid1 and the 1.8TB one in just raid0. The raid0 was used as a boot disk for Proxmox and the raid1 was used as the vm storage.

When I installed the disks, the 1.8TB disk gave a predictive failure warning, but I ignored it at the time. Looking back at it now, I should have been more carefull…

Last week all my vm’s were shutdown and Proxmox failed to boot. Cue the investigation. It gave an error like when trying to load the initramfs:

Error: failure reading sector 0x...... from 'hd0'

I headed into the Smart Intelligent Provisioning software that comes with the server. After every boot it switched between failed disk and predicted to fail. That did not give me a lot of hope.

First I tried to flash a SystemRescue USB and start a recovering process. It confused me quite a bit since Proxmox uses LVM to manage it’s volumes and I never used it. This was how the devices were setup:

  • /dev/sda1: BIOS boot
  • /dev/sda2: EFI system
  • /dev/sda3: Linux LVM

I wasn’t able to mount sda1 and sda3 and gave the error: wrong fs, bad option, bad superblock. sda2 was mountable but since it is the EFI partition it was not that interesting. I first wanted to try to generate a new initramfs, in the hopes that would fix it but I was not sure how.

I tried a lot of tools on the SystemRescue system, but they either did not give a lot of information or it crashed the disks. For example I tried to run dd and ddrescue, but everytime it reached 8.5GB the disk controller would crash and cause a lot of errors. And since the SystemRescue USB does not have persistant storage it had to be restarted everytime. At this point I thought let’s just try reinstalling Proxmox on some new disks and see how many vm’s I can save.

I got some new second hand 1.8TB HDD’s which I plugged in and created a new raid1 volume. Installing Proxmox on it and adding the other raid1 volume as the data store. As I expected I got a few vm disks back, but only half of what I hoped. That got me thinking where the others could be… Running lvs got me the following overview:

root@proxmox:/mnt# lvs
  LV                              VG               Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data                            pve              twi-a-tz--  <1.49t             0.00   0.14
  root                            pve              -wi-ao----  96.00g
  swap                            pve              -wi-ao----   8.00g
  data                            pve-OLD-B51443BC twi-aotz--  <1.49t             12.24  0.51
  root                            pve-OLD-B51443BC -wi-a-----  96.00g
  swap                            pve-OLD-B51443BC -wi-a-----   8.00g
  vm-106-disk-0                   pve-OLD-B51443BC Vwi-a-tz--  32.00g data        8.12
  vm-107-disk-0                   pve-OLD-B51443BC Vwi-a-tz--  32.00g data        11.36
  vm-108-disk-0                   pve-OLD-B51443BC Vwi-a-tz--  64.00g data        18.20
  vm-108-state-suspend-2024-07-24 pve-OLD-B51443BC Vwi-a-tz-- <16.49g data        38.26
  vm-109-disk-0                   pve-OLD-B51443BC Vwi-a-tz--  64.00g data        100.00
  vm-110-disk-0                   pve-OLD-B51443BC Vwi-a-tz--  64.00g data        30.08
  vm-111-disk-0                   pve-OLD-B51443BC Vwi-a-tz--  64.00g data        18.43
  vm-112-disk-0                   pve-OLD-B51443BC Vwi-a-tz-- 192.00g data        32.82
  vm-113-disk-0                   pve-OLD-B51443BC Vwi-a-tz--  50.00g data        8.02
  vm-113-state-suspend-2024-08-19 pve-OLD-B51443BC Vwi-a-tz--  <8.49g data        3.98

As you can see we have a LV called pve which contains the current Proxmox installation and a pve-OLD-B51443BC which contains the old installation, but also a handful of disk images I was missing. But how do I get these off if I can’t mount them.

Using ddrescue I tried to copy the failed disk to a 2TB external HDD but every few GB’s it crashed the /dev/sda drive so I had to reboot the server. It was slow progres… In the end it was able to recover 95.44% of the data.

Then I ran into the issue of having two disks that had the same logical volume names and UUID’s. Let’s disable the usb disk:

root@proxmox:/mnt# udisksctl power-off -b /dev/sdf

Rename old disk:

root@proxmox:/mnt# vgrename -v mIoiwe-0mNK-8udf-BMxg-2LdO-hZXI-TvOnjz pve-OLD
  Processing VG pve-OLD-B51443BC because of matching UUID mIoiwe-0mNK-8udf-BMxg-2LdO-hZXI-TvOnjz
  Writing out updated volume group
  Archiving volume group "pve-OLD-B51443BC" metadata (seqno 57).
  Creating volume group backup "/etc/lvm/backup/pve-OLD" (seqno 58).
  Volume group "mIoiwe-0mNK-8udf-BMxg-2LdO-hZXI-TvOnjz" successfully renamed to "pve-OLD"

Then generate new UUID for the pve-OLD vg:

root@proxmox:~# vgchange --uuid pve-OLD

Generate new UUID for pv:

root@proxmox:~# vgchange -an pve-OLD
  0 logical volume(s) in volume group "pve-OLD" now active
root@proxmox:~# pvchange --uuid /dev/sda3
  Physical volume "/dev/sda3" changed
  1 physical volume changed / 0 physical volumes not changed

Reboot the server. After that we finally have all three:

root@proxmox:~# pvs
  PV         VG               Fmt  Attr PSize  PFree
  /dev/sda3  pve-OLD          lvm2 a--  <1.64t 16.00g
  /dev/sdc3  pve              lvm2 a--  <1.64t 16.00g
  /dev/sde1  pve-OLD-B51443BC lvm2 a--  <1.64t 16.00g
root@proxmox:~# lvs
  LV                              VG               Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  data                            pve              twi-a-tz--  <1.49t             0.00   0.14
  root                            pve              -wi-ao----  96.00g
  swap                            pve              -wi-ao----   8.00g
  data                            pve-OLD          twi-aotz--  <1.49t             12.24  0.51
  root                            pve-OLD          -wi-a-----  96.00g
  swap                            pve-OLD          -wi-a-----   8.00g
  vm-106-disk-0                   pve-OLD          Vwi-a-tz--  32.00g data        8.12
  vm-107-disk-0                   pve-OLD          Vwi-a-tz--  32.00g data        11.36
  vm-108-disk-0                   pve-OLD          Vwi-a-tz--  64.00g data        18.20
  vm-108-state-suspend-2024-07-24 pve-OLD          Vwi-a-tz-- <16.49g data        38.26
  vm-109-disk-0                   pve-OLD          Vwi-a-tz--  64.00g data        100.00
  vm-110-disk-0                   pve-OLD          Vwi-a-tz--  64.00g data        30.08
  vm-111-disk-0                   pve-OLD          Vwi-a-tz--  64.00g data        18.43
  vm-112-disk-0                   pve-OLD          Vwi-a-tz-- 192.00g data        32.82
  vm-113-disk-0                   pve-OLD          Vwi-a-tz--  50.00g data        8.02
  vm-113-state-suspend-2024-08-19 pve-OLD          Vwi-a-tz--  <8.49g data        3.98
  data                            pve-OLD-B51443BC twi---tz--  <1.49t
  root                            pve-OLD-B51443BC -wi-a-----  96.00g
  swap                            pve-OLD-B51443BC -wi-a-----   8.00g
  vm-106-disk-0                   pve-OLD-B51443BC Vwi---tz--  32.00g data
  vm-107-disk-0                   pve-OLD-B51443BC Vwi---tz--  32.00g data
  vm-108-disk-0                   pve-OLD-B51443BC Vwi---tz--  64.00g data
  vm-108-state-suspend-2024-07-24 pve-OLD-B51443BC Vwi---tz-- <16.49g data
  vm-109-disk-0                   pve-OLD-B51443BC Vwi---tz--  64.00g data
  vm-110-disk-0                   pve-OLD-B51443BC Vwi---tz--  64.00g data
  vm-111-disk-0                   pve-OLD-B51443BC Vwi---tz--  64.00g data
  vm-112-disk-0                   pve-OLD-B51443BC Vwi---tz-- 192.00g data
  vm-113-disk-0                   pve-OLD-B51443BC Vwi---tz--  50.00g data
  vm-113-state-suspend-2024-08-19 pve-OLD-B51443BC Vwi---tz--  <8.49g data

But still I am unable to mount the the volumes…

root@proxmox:/mnt# mount /dev/pve-OLD/vm-106-disk-0 test -o ro,user
mount: /mnt/test: wrong fs type, bad option, bad superblock on /dev/mapper/pve--OLD-vm--106--disk--0, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.

At this point I have no clue what else I could try. There isn’t any important data lost, but it was a fun rabbit hole to go down. If there are any other things I could try, let me know by mailing me!