summaryrefslogtreecommitdiff
path: root/genesishosting/disrec
diff options
context:
space:
mode:
authordoc <doc@filenotfound.org>2025-06-30 20:06:28 +0000
committerdoc <doc@filenotfound.org>2025-06-30 20:06:28 +0000
commit717fcb9c81d2bc3cc7a84a3ebea6572d7ff0f5cf (patch)
tree7cbd6a8d5046409a82b22d34b01aac93b3e24818 /genesishosting/disrec
parent8368ff389ec596dee6212ebeb85e01c638364fb3 (diff)
uploading documentationHEADmaster
Diffstat (limited to 'genesishosting/disrec')
-rw-r--r--genesishosting/disrec/zfsdestroycasestudy.md64
1 files changed, 64 insertions, 0 deletions
diff --git a/genesishosting/disrec/zfsdestroycasestudy.md b/genesishosting/disrec/zfsdestroycasestudy.md
new file mode 100644
index 0000000..aa330ec
--- /dev/null
+++ b/genesishosting/disrec/zfsdestroycasestudy.md
@@ -0,0 +1,64 @@
+# 📛 Case Study: Why RAID Is Not a Backup
+
+## Overview
+
+On May 4, 2025, we experienced a production data loss incident involving the `nexus` dataset on `shredderv1`, a Linux RAID5 server. Despite no hardware failure, critical files were lost due to an unintended command affecting live data.
+
+This incident serves as a clear, real-world illustration of the maxim:
+
+> **RAID protects against hardware failure — not human error, data corruption, or bad automation.**
+
+---
+
+## 🔍 What Happened
+
+- `shredderv1` uses RAID5 for media storage.
+- The dataset `nexus/miniodata` (housing `genesisassets`, `genesislibrary`, etc.) was accidentally destroyed.
+- **No disks failed.** The failure was logical, not physical.
+
+---
+
+## 🔥 Impact
+
+- StationPlaylist (SPL) lost access to the Genesis media library.
+- MinIO bucket data was instantly inaccessible.
+- Temporary outage and scrambling to reconfigure mounts, media, and streaming.
+
+---
+
+## ✅ Recovery
+
+Thanks to our disaster recovery stack:
+
+- Nightly **rsync backups** were synced to **The Vault** (backup server).
+- **ZFS snapshots** existed on The Vault for the affected datasets.
+- We restored the latest snapshot **from The Vault back to Shredder**, effectively reversing the loss.
+- No data corruption occurred; sync validation showed dataset integrity.
+
+---
+
+## 🎓 Takeaway
+
+This is a live demonstration of why:
+
+- **RAID is not a backup**
+- **Snapshots without off-host replication** are not enough
+- **Real backups must be off-server and regularly tested**
+
+---
+
+## 🔐 Current Protection Measures
+
+- Production data (`genesisassets`, `genesislibrary`) now replicated nightly to The Vault via `rsync`.
+- ZFS snapshots are validated daily via a **dry-run restore validator**.
+- Telegram alerts notify success/failure of backup verification jobs.
+- Future goal: full ZFS storage on all production servers for native snapshot support.
+
+---
+
+## 🧠 Lessons Learned
+
+- Always assume you'll delete the wrong thing eventually.
+- Snapshots are amazing — **if** they're somewhere else.
+- Automated restore testing should be part of every backup pipeline.
+