My Proxmox Disaster Recovery Test (of the unplanned variety)

Welp, I lost another drive.

Homelab disasters are inevitable; expected even. Recently, a failed disk resulted in losing an entire node in my Proxmox cluster. While the initial shock was significant, a solid  backup/restore plan and a High Availability (HA) setup ensured a pretty swift recovery.

When I first started playing with my homelab, I was primarily using RPis and I lost a couple SD cards to corruption or failure. This helped to demonstrate to me the importance of regular backups, and I’ve made an effort to back up my homelab systems ever since. I now virtualize most of my servers via Proxmox and perform nightly backups to a local Proxmox Backup Server (PBS), which is then synchronized to an offsite PBS server. I’ve tested the restore function a few times and it seemed like a fairly straight-forward process.

For background, my current Proxmox cluster is comprised of three Lenovo Tiny SFF PCs. Two of those PCs are currently running only a single internal storage device which is used for both the boot and host partitions as well as storage for all of the guest OSes. This means that if a disk fails, it takes out the entire node and everything running on it.  

...Which is exactly what happened a couple weeks ago. Booting into BIOS showed that the drive failed so hard that the system didn’t even acknowledge there was a drive installed at all. The drive, by the way, was a Critical branded NVMe that I had purchased only two months prior. That’s just enough time to be outside of Amazon’s return period, yet significantly short of any reasonable life expectancy… but I digress. With the failing of the drive, I lost that node’s Proxmox host OS and all of the VMs and containers running on that node. For the HA enabled guests, they were automatically migrated to one of the remaining nodes, exactly as intended (yay!). For the non-HA guests, I had to manually restore them from PBS backup. I was quite pleased with how quick and easy it was to restore a guest from PBS. Everything could be done through the web gui in just a couple clicks. It’s obviously never fun to lose a disk, but PBS made the recovery pretty painless overall.


With the guests back up and running again, I removed the failed node from the cluster and purged any its remaining config files from /etc/pve/nodes.

For the failed node itself, I had to replace the drive and then reinstall Proxmox from scratch. From there, I pointed apt to my aptcacher-ng server and then ran a quick Post Install Script, before configuring my network devices and finally adding the “new” node back into the cluster. The whole process took only a couple hours (including troubleshooting and physically installing the new drive), and most of the hosted systems (such as this blog) were only offline for a handful of minutes, thanks to the High Availability set-up.

Needless to say, I was quite happy with my PBS experience.   
...And not so much my experience with Critical’s NVMe drives. 

Add a comment

HTML code is displayed as text and web addresses are automatically converted.

This post's comments feed