I recently made a huge mistake. My self-hosted setup is more of a production environment than something I do for fun. The old Dell PowerEdge in my basement stores and serves tons of important data; or at least data that is important to me and my family. Documents, photos, movies, etc. It's all there in that big black box.
A few weeks ago, I decided to migrate from Hyper-V to Proxmox VE (PVE). Hyper-V Server 2019 is out of mainstream support and I'm trying to aggressively reduce my dependence on Microsoft. The migration was a little time consuming but overall went over without a hitch.
I had been using Veeam for backups but Veeam's Proxmox support is kind of "meh" and it made sense to move to Proxmox Backup Server (PBS) since I was already using their virtualization system. My server uses hardware raid and has two virtual disk arrays. One for VM virtual disk storage and one for backup storage. Previously, Veeam was dumping backups to the backup storage array and copying them to S3 storage offsite. I should note that storing backups on the same host being backed up is not advisable. However, sometimes you have to make compromises, especially if you want to keep costs down, and I figured that as long as I stayed on top of the offsite replications, I would be fine in the event of a major hardware failure.
With the migration to Proxmox, the plan was to offload the backups to a PBS physical server on-site which would then replicate those to another PBS host in the cloud. There were some problems with the new on-site PBS server which left me looking for a stop-gap solution.
Here's where the problems started. Proxmox VE can backup to storage without the need for PBS. I started doing that just so I had some sort of backups. I quickly learned that PBS can replicate storage from other PBS servers. It cannot, however, replicate storage from Proxmox VE. I thought, "Ok. I'll just spin up a PBS VM and dump backups to the backup disk array like I was doing with Veeam."
Hyper-V has a very straight forward process for giving VM's direct access to physical disks. It's doable in Proxmox VE (which is built on Debian) but less straight forward. I spun up my PBS VM, unmounted the backup disk array from the PVE host, and assigned it as mapped storage to the new PBS VM. ...or at least I thought that's what I did.
I got everything configured and started running local backups which ran like complete and utter shit. I thought, "Huh. That's strange. Oh well, it's temporary anyways." and went on with my day. About two days later, I go to access Paperless-ngx and it won't come up. I check the VM console. VM is frozen. I hard reset it aaaannnnddd now it won't boot. I start digging into it and find that the virtual HDD is corrupt. fsck is unable to repair it and I'm scratching my head trying to figure out what is going on.
I continued investigating until I noticed something. The physical disk id that's mapped to the PBS VM is the same as the id of the host VM storage disk. At that point, I realize just how fucked I actually am. The host server and the PBS VM have been trying to write to the same disk array for the better part of two days. There's a solid chance that the entire disk is corrupt and unrecoverable. VM data, backups, all of it. I'm sweating bullets because there are tons of important documents, pictures of my kids, and other stuff in there that I can't afford to lose.
Half a day working the physical disk over with various data recovery tools confirmed my worst fears: Everything on it is gone. Completely corrupted and unreadable.
Then I caught a break. After I initially unmounted the [correct] backup array from PVE it's just been sitting there untouched. Every once in a great while, my incompetence works out to my advantage I guess. All the backups that were created directly from PVE, without PBS, were still in tact. A few days old at this point but still way better than nothing. As I write this, I'm waiting on the last restore to finish. I managed to successfully restore all the other VM's.
What's really bad about this is I'm a veteran. I've been in IT in some form for almost 20 years. I know better. Making mistakes is OK and is just part of learning. You have to plan for the fact that you WILL make mistakes and systems WILL fail. If you don't, you might find yourself up shit creek without a paddle.
So what did I do wrong in this situation?
- First, I failed to adequately plan ahead. I knew there were risks involved but I failed to appreciate the seriousness of those risks, much less mitigate them. What I should have done was go and buy a high capacity external drive, using it to make absolutely sure I had a known good backup of everything stored separately from my server. My inner cheapskate talked me out of it. That was a mistake.
- Second, I failed to verify, verify, verify, and verify again that I was using the correct disk id. I already said this once but I'll repeat it: storing backups on the host being backed up is ill advised. In an enterprise environment, it would be completely unacceptable. With self-hosting, it's understandable, especially given that redundancy is expensive. If you are storing backups on the server being backed up, even if it's on removable storage, you need to make sure you have a redundant offsite backup and that it is fully functional.
Luck is not a disaster recovery plan. That was a close call for me. Way too close for my comfort.
I've had this notion for a long time. I do store backups on a separate drive on the server, but those are replicated almost immediately, elsewhere. I learned my backup lesson quite a while ago and I do not wish to repeat that disaster.