diff options
Diffstat (limited to 'genesishosting/disrec/zfsdestroycasestudy.md')
-rw-r--r-- | genesishosting/disrec/zfsdestroycasestudy.md | 64 |
1 files changed, 64 insertions, 0 deletions
diff --git a/genesishosting/disrec/zfsdestroycasestudy.md b/genesishosting/disrec/zfsdestroycasestudy.md new file mode 100644 index 0000000..aa330ec --- /dev/null +++ b/genesishosting/disrec/zfsdestroycasestudy.md @@ -0,0 +1,64 @@ +# 📛 Case Study: Why RAID Is Not a Backup + +## Overview + +On May 4, 2025, we experienced a production data loss incident involving the `nexus` dataset on `shredderv1`, a Linux RAID5 server. Despite no hardware failure, critical files were lost due to an unintended command affecting live data. + +This incident serves as a clear, real-world illustration of the maxim: + +> **RAID protects against hardware failure — not human error, data corruption, or bad automation.** + +--- + +## 🔍 What Happened + +- `shredderv1` uses RAID5 for media storage. +- The dataset `nexus/miniodata` (housing `genesisassets`, `genesislibrary`, etc.) was accidentally destroyed. +- **No disks failed.** The failure was logical, not physical. + +--- + +## 🔥 Impact + +- StationPlaylist (SPL) lost access to the Genesis media library. +- MinIO bucket data was instantly inaccessible. +- Temporary outage and scrambling to reconfigure mounts, media, and streaming. + +--- + +## ✅ Recovery + +Thanks to our disaster recovery stack: + +- Nightly **rsync backups** were synced to **The Vault** (backup server). +- **ZFS snapshots** existed on The Vault for the affected datasets. +- We restored the latest snapshot **from The Vault back to Shredder**, effectively reversing the loss. +- No data corruption occurred; sync validation showed dataset integrity. + +--- + +## 🎓 Takeaway + +This is a live demonstration of why: + +- **RAID is not a backup** +- **Snapshots without off-host replication** are not enough +- **Real backups must be off-server and regularly tested** + +--- + +## 🔐 Current Protection Measures + +- Production data (`genesisassets`, `genesislibrary`) now replicated nightly to The Vault via `rsync`. +- ZFS snapshots are validated daily via a **dry-run restore validator**. +- Telegram alerts notify success/failure of backup verification jobs. +- Future goal: full ZFS storage on all production servers for native snapshot support. + +--- + +## 🧠 Lessons Learned + +- Always assume you'll delete the wrong thing eventually. +- Snapshots are amazing — **if** they're somewhere else. +- Automated restore testing should be part of every backup pipeline. + |