A Failed Data Recovery Adventure
[This is a post from my old blog posted on 2024-10-07]
For a few months I have been running my second hand HPE ProLiant DL360 Gen9 server. I had it setup with 3 HDD’s. Two 1.2TB HDD’s and one 1.8TB HDD. The two 1.2TB disks were setup in raid1 and the 1.8TB one in just raid0. The raid0 was used as a boot disk for Proxmox and the raid1 was used as the vm storage.
When I installed the disks, the 1.8TB disk gave a predictive failure warning, but I ignored it at the time. Looking back at it now, I should have been more carefull…
Last week all my vm’s were shutdown and Proxmox failed to boot. Cue the investigation. It gave an error like when trying to load the initramfs:
Error: failure reading sector 0x...... from 'hd0'
I headed into the Smart Intelligent Provisioning software that comes with the server. After every boot it switched between failed disk and predicted to fail. That did not give me a lot of hope.
First I tried to flash a SystemRescue USB and start a recovering process. It confused me quite a bit since Proxmox uses LVM to manage it’s volumes and I never used it. This was how the devices were setup:
/dev/sda1: BIOS boot/dev/sda2: EFI system/dev/sda3: Linux LVM
I wasn’t able to mount sda1 and sda3 and gave the error: wrong fs, bad option, bad superblock. sda2 was mountable but since it is the EFI partition it was not that interesting. I first wanted to try to generate a new initramfs, in the hopes that would fix it but I was not sure how.
I tried a lot of tools on the SystemRescue system, but they either did not give a lot of information or it crashed the disks. For example I tried to run dd and ddrescue, but everytime it reached 8.5GB the disk controller would crash and cause a lot of errors. And since the SystemRescue USB does not have persistant storage it had to be restarted everytime. At this point I thought let’s just try reinstalling Proxmox on some new disks and see how many vm’s I can save.
I got some new second hand 1.8TB HDD’s which I plugged in and created a new raid1 volume. Installing Proxmox on it and adding the other raid1 volume as the data store. As I expected I got a few vm disks back, but only half of what I hoped. That got me thinking where the others could be… Running lvs got me the following overview:
root@proxmox:/mnt# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data pve twi-a-tz-- <1.49t 0.00 0.14
root pve -wi-ao---- 96.00g
swap pve -wi-ao---- 8.00g
data pve-OLD-B51443BC twi-aotz-- <1.49t 12.24 0.51
root pve-OLD-B51443BC -wi-a----- 96.00g
swap pve-OLD-B51443BC -wi-a----- 8.00g
vm-106-disk-0 pve-OLD-B51443BC Vwi-a-tz-- 32.00g data 8.12
vm-107-disk-0 pve-OLD-B51443BC Vwi-a-tz-- 32.00g data 11.36
vm-108-disk-0 pve-OLD-B51443BC Vwi-a-tz-- 64.00g data 18.20
vm-108-state-suspend-2024-07-24 pve-OLD-B51443BC Vwi-a-tz-- <16.49g data 38.26
vm-109-disk-0 pve-OLD-B51443BC Vwi-a-tz-- 64.00g data 100.00
vm-110-disk-0 pve-OLD-B51443BC Vwi-a-tz-- 64.00g data 30.08
vm-111-disk-0 pve-OLD-B51443BC Vwi-a-tz-- 64.00g data 18.43
vm-112-disk-0 pve-OLD-B51443BC Vwi-a-tz-- 192.00g data 32.82
vm-113-disk-0 pve-OLD-B51443BC Vwi-a-tz-- 50.00g data 8.02
vm-113-state-suspend-2024-08-19 pve-OLD-B51443BC Vwi-a-tz-- <8.49g data 3.98
As you can see we have a LV called pve which contains the current Proxmox installation and a pve-OLD-B51443BC which contains the old installation, but also a handful of disk images I was missing. But how do I get these off if I can’t mount them.
Using ddrescue I tried to copy the failed disk to a 2TB external HDD but every few GB’s it crashed the /dev/sda drive so I had to reboot the server. It was slow progres… In the end it was able to recover 95.44% of the data.
Then I ran into the issue of having two disks that had the same logical volume names and UUID’s. Let’s disable the usb disk:
root@proxmox:/mnt# udisksctl power-off -b /dev/sdf
Rename old disk:
root@proxmox:/mnt# vgrename -v mIoiwe-0mNK-8udf-BMxg-2LdO-hZXI-TvOnjz pve-OLD
Processing VG pve-OLD-B51443BC because of matching UUID mIoiwe-0mNK-8udf-BMxg-2LdO-hZXI-TvOnjz
Writing out updated volume group
Archiving volume group "pve-OLD-B51443BC" metadata (seqno 57).
Creating volume group backup "/etc/lvm/backup/pve-OLD" (seqno 58).
Volume group "mIoiwe-0mNK-8udf-BMxg-2LdO-hZXI-TvOnjz" successfully renamed to "pve-OLD"
Then generate new UUID for the pve-OLD vg:
root@proxmox:~# vgchange --uuid pve-OLD
Generate new UUID for pv:
root@proxmox:~# vgchange -an pve-OLD
0 logical volume(s) in volume group "pve-OLD" now active
root@proxmox:~# pvchange --uuid /dev/sda3
Physical volume "/dev/sda3" changed
1 physical volume changed / 0 physical volumes not changed
Reboot the server. After that we finally have all three:
root@proxmox:~# pvs
PV VG Fmt Attr PSize PFree
/dev/sda3 pve-OLD lvm2 a-- <1.64t 16.00g
/dev/sdc3 pve lvm2 a-- <1.64t 16.00g
/dev/sde1 pve-OLD-B51443BC lvm2 a-- <1.64t 16.00g
root@proxmox:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
data pve twi-a-tz-- <1.49t 0.00 0.14
root pve -wi-ao---- 96.00g
swap pve -wi-ao---- 8.00g
data pve-OLD twi-aotz-- <1.49t 12.24 0.51
root pve-OLD -wi-a----- 96.00g
swap pve-OLD -wi-a----- 8.00g
vm-106-disk-0 pve-OLD Vwi-a-tz-- 32.00g data 8.12
vm-107-disk-0 pve-OLD Vwi-a-tz-- 32.00g data 11.36
vm-108-disk-0 pve-OLD Vwi-a-tz-- 64.00g data 18.20
vm-108-state-suspend-2024-07-24 pve-OLD Vwi-a-tz-- <16.49g data 38.26
vm-109-disk-0 pve-OLD Vwi-a-tz-- 64.00g data 100.00
vm-110-disk-0 pve-OLD Vwi-a-tz-- 64.00g data 30.08
vm-111-disk-0 pve-OLD Vwi-a-tz-- 64.00g data 18.43
vm-112-disk-0 pve-OLD Vwi-a-tz-- 192.00g data 32.82
vm-113-disk-0 pve-OLD Vwi-a-tz-- 50.00g data 8.02
vm-113-state-suspend-2024-08-19 pve-OLD Vwi-a-tz-- <8.49g data 3.98
data pve-OLD-B51443BC twi---tz-- <1.49t
root pve-OLD-B51443BC -wi-a----- 96.00g
swap pve-OLD-B51443BC -wi-a----- 8.00g
vm-106-disk-0 pve-OLD-B51443BC Vwi---tz-- 32.00g data
vm-107-disk-0 pve-OLD-B51443BC Vwi---tz-- 32.00g data
vm-108-disk-0 pve-OLD-B51443BC Vwi---tz-- 64.00g data
vm-108-state-suspend-2024-07-24 pve-OLD-B51443BC Vwi---tz-- <16.49g data
vm-109-disk-0 pve-OLD-B51443BC Vwi---tz-- 64.00g data
vm-110-disk-0 pve-OLD-B51443BC Vwi---tz-- 64.00g data
vm-111-disk-0 pve-OLD-B51443BC Vwi---tz-- 64.00g data
vm-112-disk-0 pve-OLD-B51443BC Vwi---tz-- 192.00g data
vm-113-disk-0 pve-OLD-B51443BC Vwi---tz-- 50.00g data
vm-113-state-suspend-2024-08-19 pve-OLD-B51443BC Vwi---tz-- <8.49g data
But still I am unable to mount the the volumes…
root@proxmox:/mnt# mount /dev/pve-OLD/vm-106-disk-0 test -o ro,user
mount: /mnt/test: wrong fs type, bad option, bad superblock on /dev/mapper/pve--OLD-vm--106--disk--0, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
At this point I have no clue what else I could try. There isn’t any important data lost, but it was a fun rabbit hole to go down. If there are any other things I could try, let me know by mailing me!