summaryrefslogtreecommitdiff
path: root/genesishosting/disrec/zfsdestroycasestudy.md
blob: aa330ec376302e895df0a36194aab6c0a2fa490e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# 📛 Case Study: Why RAID Is Not a Backup

## Overview

On May 4, 2025, we experienced a production data loss incident involving the `nexus` dataset on `shredderv1`, a Linux RAID5 server. Despite no hardware failure, critical files were lost due to an unintended command affecting live data.

This incident serves as a clear, real-world illustration of the maxim:

> **RAID protects against hardware failure — not human error, data corruption, or bad automation.**

---

## 🔍 What Happened

- `shredderv1` uses RAID5 for media storage.
- The dataset `nexus/miniodata` (housing `genesisassets`, `genesislibrary`, etc.) was accidentally destroyed.
- **No disks failed.** The failure was logical, not physical.

---

## 🔥 Impact

- StationPlaylist (SPL) lost access to the Genesis media library.
- MinIO bucket data was instantly inaccessible.
- Temporary outage and scrambling to reconfigure mounts, media, and streaming.

---

## ✅ Recovery

Thanks to our disaster recovery stack:

- Nightly **rsync backups** were synced to **The Vault** (backup server).
- **ZFS snapshots** existed on The Vault for the affected datasets.
- We restored the latest snapshot **from The Vault back to Shredder**, effectively reversing the loss.
- No data corruption occurred; sync validation showed dataset integrity.

---

## 🎓 Takeaway

This is a live demonstration of why:

- **RAID is not a backup**
- **Snapshots without off-host replication** are not enough
- **Real backups must be off-server and regularly tested**

---

## 🔐 Current Protection Measures

- Production data (`genesisassets`, `genesislibrary`) now replicated nightly to The Vault via `rsync`.
- ZFS snapshots are validated daily via a **dry-run restore validator**.
- Telegram alerts notify success/failure of backup verification jobs.
- Future goal: full ZFS storage on all production servers for native snapshot support.

---

## 🧠 Lessons Learned

- Always assume you'll delete the wrong thing eventually.
- Snapshots are amazing — **if** they're somewhere else.
- Automated restore testing should be part of every backup pipeline.